Implement phases for skill iteration tuning: Evaluation, Improvement, and Reporting

- Added Phase 3: Evaluate Quality with steps for preparing context, constructing evaluation prompts, executing evaluation via CLI, parsing scores, and checking termination conditions. - Introduced Phase 4: Apply Improvements to implement targeted changes based on evaluation suggestions, including agent execution and change documentation. - Created Phase 5: Final Report to generate a comprehensive report of the iteration process, including score progression and remaining weaknesses. - Established evaluation criteria in a new document to guide the evaluation process. - Developed templates for evaluation and execution prompts to standardize input for the evaluation and execution phases.
2026-03-12 17:21:19 +08:00 · 2026-03-10 21:42:58 +08:00
parent b4ad8c7b80
commit 9fb13ed6b0
10 changed files with 1777 additions and 0 deletions
--- a/.claude/skills/skill-iter-tune/phases/01-setup.md
+++ b/.claude/skills/skill-iter-tune/phases/01-setup.md
@@ -0,0 +1,144 @@
+# Phase 1: Setup
+
+Initialize workspace, backup skills, parse inputs.
+
+## Objective
+
+- Parse skill path(s) and test scenario from user input
+- Validate all skill paths exist and contain SKILL.md
+- Create isolated workspace directory structure
+- Backup original skill files
+- Initialize iteration-state.json
+
+## Execution
+
+### Step 1.1: Parse Input
+
+Parse `$ARGUMENTS` to extract skill paths and test scenario.
+
+```javascript
+// Parse skill paths (first argument or comma-separated)
+const args = $ARGUMENTS.trim();
+const pathMatch = args.match(/^([^\s]+)/);
+const rawPaths = pathMatch ? pathMatch[1].split(',') : [];
+
+// Parse test scenario
+const scenarioMatch = args.match(/(?:--scenario|--test)\s+"([^"]+)"/);
+const scenarioText = scenarioMatch ? scenarioMatch[1] : args.replace(rawPaths.join(','), '').trim();
+
+// Record chain order (preserves input order for chain mode)
+const chainOrder = rawPaths.map(p => p.startsWith('.claude/') ? p.split('/').pop() : p);
+
+// If no scenario, ask user
+if (!scenarioText) {
+  const response = AskUserQuestion({
+    questions: [{
+      question: "Please describe the test scenario for evaluating this skill:",
+      header: "Test Scenario",
+      multiSelect: false,
+      options: [
+        { label: "General quality test", description: "Evaluate overall skill quality with a generic task" },
+        { label: "Specific scenario", description: "I'll describe a specific test case" }
+      ]
+    }]
+  });
+  // Use response to construct testScenario
+}
+```
+
+### Step 1.2: Validate Skill Paths
+
+```javascript
+const targetSkills = [];
+for (const rawPath of rawPaths) {
+  const skillPath = rawPath.startsWith('.claude/') ? rawPath : `.claude/skills/${rawPath}`;
+
+  // Validate SKILL.md exists
+  const skillFiles = Glob(`${skillPath}/SKILL.md`);
+  if (skillFiles.length === 0) {
+    throw new Error(`Skill not found at: ${skillPath} -- SKILL.md missing`);
+  }
+
+  // Collect all skill files
+  const allFiles = Glob(`${skillPath}/**/*.md`);
+  targetSkills.push({
+    name: skillPath.split('/').pop(),
+    path: skillPath,
+    files: allFiles.map(f => f.replace(skillPath + '/', '')),
+    primary_file: 'SKILL.md'
+  });
+}
+```
+
+### Step 1.3: Create Workspace
+
+```javascript
+const ts = Date.now();
+const workDir = `.workflow/.scratchpad/skill-iter-tune-${ts}`;
+
+Bash(`mkdir -p "${workDir}/backups" "${workDir}/iterations"`);
+```
+
+### Step 1.4: Backup Original Skills
+
+```javascript
+for (const skill of targetSkills) {
+  Bash(`cp -r "${skill.path}" "${workDir}/backups/${skill.name}"`);
+}
+```
+
+### Step 1.5: Initialize State
+
+Write `iteration-state.json` with initial state:
+
+```javascript
+const initialState = {
+  status: 'running',
+  started_at: new Date().toISOString(),
+  updated_at: new Date().toISOString(),
+  target_skills: targetSkills,
+  test_scenario: {
+    description: scenarioText,
+    // Parse --requirements and --input-args from $ARGUMENTS if provided
+    // e.g., --requirements "clear output,no errors" --input-args "my-skill --scenario test"
+    requirements: parseListArg(args, '--requirements') || [],
+    input_args: parseStringArg(args, '--input-args') || '',
+    success_criteria: parseStringArg(args, '--success-criteria') || 'Produces correct, high-quality output'
+  },
+  execution_mode: workflowPreferences.executionMode || 'single',
+  chain_order: workflowPreferences.executionMode === 'chain'
+    ? targetSkills.map(s => s.name)
+    : [],
+  current_iteration: 0,
+  max_iterations: workflowPreferences.maxIterations,
+  quality_threshold: workflowPreferences.qualityThreshold,
+  latest_score: 0,
+  score_trend: [],
+  converged: false,
+  iterations: [],
+  errors: [],
+  error_count: 0,
+  max_errors: 3,
+  work_dir: workDir,
+  backup_dir: `${workDir}/backups`
+};
+
+Write(`${workDir}/iteration-state.json`, JSON.stringify(initialState, null, 2));
+
+// Chain mode: create per-skill tracking tasks
+if (initialState.execution_mode === 'chain') {
+  for (const skill of targetSkills) {
+    TaskCreate({
+      subject: `Chain: ${skill.name}`,
+      activeForm: `Tracking ${skill.name}`,
+      description: `Skill chain member: ${skill.path} | Position: ${targetSkills.indexOf(skill) + 1}/${targetSkills.length}`
+    });
+  }
+}
+```
+
+## Output
+
+- **Variables**: `workDir`, `targetSkills[]`, `testScenario`, `chainOrder` (chain mode)
+- **Files**: `iteration-state.json`, `backups/` directory with skill copies
+- **TodoWrite**: Mark Phase 1 completed, start Iteration Loop. Chain mode: per-skill tracking tasks created
--- a/.claude/skills/skill-iter-tune/phases/02-execute.md
+++ b/.claude/skills/skill-iter-tune/phases/02-execute.md
@@ -0,0 +1,292 @@
+# Phase 2: Execute Skill
+
+> **COMPACT SENTINEL [Phase 2: Execute]**
+> This phase contains 4 execution steps (Step 2.1 -- 2.4).
+> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
+> Recovery: `Read("phases/02-execute.md")`
+
+Execute the target skill against the test scenario using `ccw cli --tool claude --mode write`. Claude receives the full skill definition and simulates producing its expected output artifacts.
+
+## Objective
+
+- Snapshot current skill version before execution
+- Construct execution prompt with full skill content + test scenario
+- Execute via ccw cli Claude
+- Collect output artifacts
+
+## Execution
+
+### Step 2.1: Snapshot Current Skill
+
+```javascript
+const N = state.current_iteration;
+const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
+Bash(`mkdir -p "${iterDir}/skill-snapshot" "${iterDir}/artifacts"`);
+
+// Chain mode: create per-skill artifact directories
+if (state.execution_mode === 'chain') {
+  for (const skillName of state.chain_order) {
+    Bash(`mkdir -p "${iterDir}/artifacts/${skillName}"`);
+  }
+}
+
+// Snapshot current skill state (so we can compare/rollback)
+for (const skill of state.target_skills) {
+  Bash(`cp -r "${skill.path}" "${iterDir}/skill-snapshot/${skill.name}"`);
+}
+```
+
+### Step 2.2: Construct Execution Prompt (Single Mode)
+
+Read the execute-prompt template and substitute variables.
+
+> Skip to Step 2.2b if `state.execution_mode === 'chain'`.
+
+```javascript
+// Ref: templates/execute-prompt.md
+
+// Build skillContent by reading only executable skill files (SKILL.md, phases/, specs/)
+// Exclude README.md, docs/, and other non-executable files to save tokens
+const skillContent = state.target_skills.map(skill => {
+  const skillMd = Read(`${skill.path}/SKILL.md`);
+  const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+  const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+
+  return `### File: SKILL.md\n${skillMd}\n\n` +
+    phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
+    (specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
+}).join('\n\n---\n\n');
+
+// Construct full prompt using template
+const executePrompt = `PURPOSE: Simulate executing the following workflow skill against a test scenario. Produce all expected output artifacts as if the skill were invoked with the given input.
+
+SKILL CONTENT:
+${skillContent}
+
+TEST SCENARIO:
+Description: ${state.test_scenario.description}
+Input Arguments: ${state.test_scenario.input_args}
+Requirements: ${state.test_scenario.requirements.join('; ')}
+Success Criteria: ${state.test_scenario.success_criteria}
+
+TASK:
+1. Study the complete skill structure (SKILL.md + all phase files)
+2. Follow the skill execution flow sequentially
+3. For each phase, produce the artifacts that phase would generate
+4. Write all output artifacts to the current working directory
+5. Create a manifest.json listing all produced artifacts
+
+MODE: write
+CONTEXT: @**/*
+EXPECTED: All artifacts written to disk + manifest.json
+CONSTRAINTS: Follow skill flow exactly, produce realistic output, not placeholders`;
+```
+
+### Step 2.3: Execute via ccw cli
+
+> **CHECKPOINT**: Before executing CLI, verify:
+> 1. This phase is TodoWrite `in_progress`
+> 2. `iterDir/artifacts/` directory exists
+> 3. Prompt is properly escaped
+
+```javascript
+function escapeForShell(str) {
+  return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
+}
+
+const cliCommand = `ccw cli -p "${escapeForShell(executePrompt)}" --tool claude --mode write --cd "${iterDir}/artifacts"`;
+
+// Execute in background, wait for hook callback
+Bash({
+  command: cliCommand,
+  run_in_background: true,
+  timeout: 600000  // 10 minutes max
+});
+
+// STOP HERE -- wait for hook callback to resume
+// After callback, verify artifacts were produced
+```
+
+### Step 2.2b: Chain Execution Path
+
+> Skip this step if `state.execution_mode === 'single'`.
+
+In chain mode, execute each skill sequentially. Each skill receives the previous skill's artifacts as input context.
+
+```javascript
+// Chain execution: iterate through chain_order
+let previousArtifacts = '';  // Accumulates upstream output
+
+for (let i = 0; i < state.chain_order.length; i++) {
+  const skillName = state.chain_order[i];
+  const skill = state.target_skills.find(s => s.name === skillName);
+  const skillArtifactDir = `${iterDir}/artifacts/${skillName}`;
+
+  // Build this skill's content
+  const skillMd = Read(`${skill.path}/SKILL.md`);
+  const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+  const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+
+  const singleSkillContent = `### File: SKILL.md\n${skillMd}\n\n` +
+    phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
+    (specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
+
+  // Build chain context from previous skill's artifacts
+  const chainInputContext = previousArtifacts
+    ? `\nPREVIOUS CHAIN OUTPUT (from upstream skill "${state.chain_order[i - 1]}"):\n${previousArtifacts}\n\nIMPORTANT: Use the above output as input context for this skill's execution.\n`
+    : '';
+
+  // Construct per-skill execution prompt
+  // Ref: templates/execute-prompt.md
+  const chainPrompt = `PURPOSE: Simulate executing the following workflow skill against a test scenario. Produce all expected output artifacts.
+
+SKILL CONTENT (${skillName} — chain position ${i + 1}/${state.chain_order.length}):
+${singleSkillContent}
+${chainInputContext}
+TEST SCENARIO:
+Description: ${state.test_scenario.description}
+Input Arguments: ${state.test_scenario.input_args}
+Requirements: ${state.test_scenario.requirements.join('; ')}
+Success Criteria: ${state.test_scenario.success_criteria}
+
+TASK:
+1. Study the complete skill structure
+2. Follow the skill execution flow sequentially
+3. Produce all expected artifacts
+4. Write output to the current working directory
+5. Create manifest.json listing all produced artifacts
+
+MODE: write
+CONTEXT: @**/*
+CONSTRAINTS: Follow skill flow exactly, produce realistic output`;
+
+  function escapeForShell(str) {
+    return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
+  }
+
+  const cliCommand = `ccw cli -p "${escapeForShell(chainPrompt)}" --tool claude --mode write --cd "${skillArtifactDir}"`;
+
+  // Execute in background
+  Bash({
+    command: cliCommand,
+    run_in_background: true,
+    timeout: 600000
+  });
+
+  // STOP -- wait for hook callback
+
+  // After callback: collect artifacts for next skill in chain
+  const artifacts = Glob(`${skillArtifactDir}/**/*`);
+  const skillSuccess = artifacts.length > 0;
+
+  if (skillSuccess) {
+    previousArtifacts = artifacts.slice(0, 10).map(f => {
+      const relPath = f.replace(skillArtifactDir + '/', '');
+      const content = Read(f, { limit: 100 });
+      return `--- ${relPath} ---\n${content}`;
+    }).join('\n\n');
+  } else {
+    // Mid-chain failure: keep previous artifacts for downstream skills
+    // Log warning but continue chain — downstream skills receive last successful output
+    state.errors.push({
+      phase: 'execute',
+      message: `Chain skill "${skillName}" (position ${i + 1}) produced no artifacts. Downstream skills will receive upstream output from "${state.chain_order[i - 1] || 'none'}" instead.`,
+      timestamp: new Date().toISOString()
+    });
+    state.error_count++;
+    // previousArtifacts remains from last successful skill (or empty if first)
+  }
+
+  // Update per-skill TodoWrite
+  // TaskUpdate chain skill task with execution status
+
+  // Record per-skill execution
+  if (!state.iterations[N - 1].execution.chain_executions) {
+    state.iterations[N - 1].execution.chain_executions = [];
+  }
+  state.iterations[N - 1].execution.chain_executions.push({
+    skill_name: skillName,
+    cli_command: cliCommand,
+    artifacts_dir: skillArtifactDir,
+    success: skillSuccess
+  });
+
+  // Check error budget: abort chain if too many consecutive failures
+  if (state.error_count >= 3) {
+    state.errors.push({
+      phase: 'execute',
+      message: `Chain execution aborted at skill "${skillName}" — error limit reached (${state.error_count} errors).`,
+      timestamp: new Date().toISOString()
+    });
+    break;
+  }
+}
+```
+
+### Step 2.4: Collect Artifacts
+
+After CLI completes (hook callback received):
+
+```javascript
+// List produced artifacts
+const artifactFiles = Glob(`${iterDir}/artifacts/**/*`);
+
+// Chain mode: check per-skill artifacts
+if (state.execution_mode === 'chain') {
+  const chainSuccess = state.iterations[N - 1].execution.chain_executions?.every(e => e.success) ?? false;
+  state.iterations[N - 1].execution.success = chainSuccess;
+  state.iterations[N - 1].execution.artifacts_dir = `${iterDir}/artifacts`;
+} else {
+
+if (artifactFiles.length === 0) {
+  // Execution produced nothing -- record error
+  state.iterations[N - 1].execution = {
+    cli_command: cliCommand,
+    started_at: new Date().toISOString(),
+    completed_at: new Date().toISOString(),
+    artifacts_dir: `${iterDir}/artifacts`,
+    success: false
+  };
+  state.error_count++;
+  // Continue to Phase 3 anyway -- Gemini can evaluate the skill even without artifacts
+} else {
+  state.iterations[N - 1].execution = {
+    cli_command: cliCommand,
+    started_at: new Date().toISOString(),
+    completed_at: new Date().toISOString(),
+    artifacts_dir: `${iterDir}/artifacts`,
+    success: true
+  };
+}
+
+} // end single mode branch
+
+// Update state
+Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+```
+
+## Error Handling
+
+| Error | Recovery |
+|-------|----------|
+| CLI timeout (10min) | Record failure, continue to Phase 3 without artifacts |
+| CLI crash | Retry once with simplified prompt (SKILL.md only, no phase files) |
+| No artifacts produced | Continue to Phase 3, evaluation focuses on skill definition quality |
+
+## Output
+
+- **Files**: `iteration-{N}/skill-snapshot/`, `iteration-{N}/artifacts/`
+- **State**: `iterations[N-1].execution` updated
+- **Next**: Phase 3 (Evaluate)
--- a/.claude/skills/skill-iter-tune/phases/03-evaluate.md
+++ b/.claude/skills/skill-iter-tune/phases/03-evaluate.md
@@ -0,0 +1,312 @@
+# Phase 3: Evaluate Quality
+
+> **COMPACT SENTINEL [Phase 3: Evaluate]**
+> This phase contains 5 execution steps (Step 3.1 -- 3.5).
+> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
+> Recovery: `Read("phases/03-evaluate.md")`
+
+Evaluate skill quality using `ccw cli --tool gemini --mode analysis`. Gemini scores the skill across 5 dimensions and provides improvement suggestions.
+
+## Objective
+
+- Construct evaluation prompt with skill + artifacts + criteria
+- Execute via ccw cli Gemini
+- Parse multi-dimensional score
+- Write iteration-{N}-eval.md
+- Check termination conditions
+
+## Execution
+
+### Step 3.1: Prepare Evaluation Context
+
+```javascript
+const N = state.current_iteration;
+const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
+
+// Read evaluation criteria
+// Ref: specs/evaluation-criteria.md
+const evaluationCriteria = Read('.claude/skills/skill-iter-tune/specs/evaluation-criteria.md');
+
+// Build skillContent (same pattern as Phase 02 — only executable files)
+const skillContent = state.target_skills.map(skill => {
+  const skillMd = Read(`${skill.path}/SKILL.md`);
+  const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+  const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
+    relativePath: f.replace(skill.path + '/', ''),
+    content: Read(f)
+  }));
+  return `### File: SKILL.md\n${skillMd}\n\n` +
+    phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
+    (specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
+}).join('\n\n---\n\n');
+
+// Build artifacts summary
+let artifactsSummary = 'No artifacts produced (execution may have failed)';
+
+if (state.execution_mode === 'chain') {
+  // Chain mode: group artifacts by skill
+  const chainSummaries = state.chain_order.map(skillName => {
+    const skillArtifactDir = `${iterDir}/artifacts/${skillName}`;
+    const files = Glob(`${skillArtifactDir}/**/*`);
+    if (files.length === 0) return `### ${skillName} (no artifacts)`;
+    const filesSummary = files.map(f => {
+      const relPath = f.replace(`${skillArtifactDir}/`, '');
+      const content = Read(f, { limit: 200 });
+      return `--- ${relPath} ---\n${content}`;
+    }).join('\n\n');
+    return `### ${skillName} (chain position ${state.chain_order.indexOf(skillName) + 1})\n${filesSummary}`;
+  });
+  artifactsSummary = chainSummaries.join('\n\n---\n\n');
+} else {
+  // Single mode (existing)
+  const artifactFiles = Glob(`${iterDir}/artifacts/**/*`);
+  if (artifactFiles.length > 0) {
+    artifactsSummary = artifactFiles.map(f => {
+      const relPath = f.replace(`${iterDir}/artifacts/`, '');
+      const content = Read(f, { limit: 200 });
+      return `--- ${relPath} ---\n${content}`;
+    }).join('\n\n');
+  }
+}
+
+// Build previous evaluation context
+const previousEvalContext = state.iterations.filter(i => i.evaluation).length > 0
+  ? `PREVIOUS ITERATIONS:\n` + state.iterations.filter(i => i.evaluation).map(iter =>
+    `Iteration ${iter.round}: Score ${iter.evaluation.score}\n` +
+    `  Applied: ${iter.improvement?.changes_applied?.map(c => c.summary).join('; ') || 'none'}\n` +
+    `  Weaknesses: ${iter.evaluation.weaknesses?.slice(0, 3).join('; ') || 'none'}`
+  ).join('\n') + '\nIMPORTANT: Focus on NEW issues not yet addressed.'
+  : '';
+```
+
+### Step 3.2: Construct Evaluation Prompt
+
+```javascript
+// Ref: templates/eval-prompt.md
+const evalPrompt = `PURPOSE: Evaluate the quality of a workflow skill by examining its definition and produced artifacts.
+
+SKILL DEFINITION:
+${skillContent}
+
+TEST SCENARIO:
+${state.test_scenario.description}
+Requirements: ${state.test_scenario.requirements.join('; ')}
+Success Criteria: ${state.test_scenario.success_criteria}
+
+ARTIFACTS PRODUCED:
+${artifactsSummary}
+
+EVALUATION CRITERIA:
+${evaluationCriteria}
+
+${previousEvalContext}
+
+${state.execution_mode === 'chain' ? `
+CHAIN CONTEXT:
+This skill chain contains ${state.chain_order.length} skills executed in order:
+${state.chain_order.map((s, i) => `${i+1}. ${s}`).join('\n')}
+Current evaluation covers the entire chain output.
+Please provide per-skill quality scores in an additional "chain_scores" field: { "${state.chain_order[0]}": <score>, ... }
+` : ''}
+
+TASK:
+1. Score each dimension (Clarity 0.20, Completeness 0.25, Correctness 0.25, Effectiveness 0.20, Efficiency 0.10) on 0-100
+2. Calculate weighted composite score
+3. List top 3 strengths
+4. List top 3-5 weaknesses with file:section references
+5. Provide 3-5 prioritized improvement suggestions with concrete changes
+
+EXPECTED OUTPUT (strict JSON, no markdown):
+{
+  "composite_score": <0-100>,
+  "dimensions": [
+    {"name":"Clarity","id":"clarity","score":<0-100>,"weight":0.20,"feedback":"..."},
+    {"name":"Completeness","id":"completeness","score":<0-100>,"weight":0.25,"feedback":"..."},
+    {"name":"Correctness","id":"correctness","score":<0-100>,"weight":0.25,"feedback":"..."},
+    {"name":"Effectiveness","id":"effectiveness","score":<0-100>,"weight":0.20,"feedback":"..."},
+    {"name":"Efficiency","id":"efficiency","score":<0-100>,"weight":0.10,"feedback":"..."}
+  ],
+  "strengths": ["...", "...", "..."],
+  "weaknesses": ["...with file:section ref...", "..."],
+  "suggestions": [
+    {"priority":"high|medium|low","target_file":"...","description":"...","rationale":"...","code_snippet":"..."}
+  ]
+}
+
+CONSTRAINTS: Be rigorous, reference exact files, focus on highest-impact changes, output ONLY JSON`;
+```
+
+### Step 3.3: Execute via ccw cli Gemini
+
+> **CHECKPOINT**: Verify evaluation prompt is properly constructed before CLI execution.
+
+```javascript
+// Shell escape utility (same as Phase 02)
+function escapeForShell(str) {
+  return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
+}
+
+const skillPath = state.target_skills[0].path;  // Primary skill for --cd
+
+const cliCommand = `ccw cli -p "${escapeForShell(evalPrompt)}" --tool gemini --mode analysis --cd "${skillPath}"`;
+
+// Execute in background
+Bash({
+  command: cliCommand,
+  run_in_background: true,
+  timeout: 300000  // 5 minutes
+});
+
+// STOP -- wait for hook callback
+```
+
+### Step 3.4: Parse Score and Write Eval File
+
+After CLI completes:
+
+```javascript
+// Parse JSON from Gemini output
+// The output may contain markdown wrapping -- extract JSON
+const rawOutput = /* CLI output from callback */;
+const jsonMatch = rawOutput.match(/\{[\s\S]*\}/);
+let evaluation;
+
+if (jsonMatch) {
+  try {
+    evaluation = JSON.parse(jsonMatch[0]);
+    // Extract chain_scores if present
+    if (state.execution_mode === 'chain' && evaluation.chain_scores) {
+      state.iterations[N - 1].evaluation.chain_scores = evaluation.chain_scores;
+    }
+  } catch (e) {
+    // Fallback: try to extract score heuristically
+    const scoreMatch = rawOutput.match(/"composite_score"\s*:\s*(\d+)/);
+    evaluation = {
+      composite_score: scoreMatch ? parseInt(scoreMatch[1]) : 50,
+      dimensions: [],
+      strengths: [],
+      weaknesses: ['Evaluation output parsing failed -- raw output saved'],
+      suggestions: []
+    };
+  }
+} else {
+  evaluation = {
+    composite_score: 50,
+    dimensions: [],
+    strengths: [],
+    weaknesses: ['No structured evaluation output -- defaulting to 50'],
+    suggestions: []
+  };
+}
+
+// Write iteration-N-eval.md
+const evalReport = `# Iteration ${N} Evaluation
+
+**Composite Score**: ${evaluation.composite_score}/100
+**Date**: ${new Date().toISOString()}
+
+## Dimension Scores
+
+| Dimension | Score | Weight | Feedback |
+|-----------|-------|--------|----------|
+${(evaluation.dimensions || []).map(d =>
+  `| ${d.name} | ${d.score} | ${d.weight} | ${d.feedback} |`
+).join('\n')}
+
+${(state.execution_mode === 'chain' && evaluation.chain_scores) ? `
+## Chain Scores
+
+| Skill | Score | Chain Position |
+|-------|-------|----------------|
+${state.chain_order.map((s, i) => `| ${s} | ${evaluation.chain_scores[s] || '-'} | ${i + 1} |`).join('\n')}
+` : ''}
+
+## Strengths
+${(evaluation.strengths || []).map(s => `- ${s}`).join('\n')}
+
+## Weaknesses
+${(evaluation.weaknesses || []).map(w => `- ${w}`).join('\n')}
+
+## Improvement Suggestions
+${(evaluation.suggestions || []).map((s, i) =>
+  `### ${i + 1}. [${s.priority}] ${s.description}\n- **Target**: ${s.target_file}\n- **Rationale**: ${s.rationale}\n${s.code_snippet ? `- **Suggested**:\n\`\`\`\n${s.code_snippet}\n\`\`\`` : ''}`
+).join('\n\n')}
+`;
+
+Write(`${iterDir}/iteration-${N}-eval.md`, evalReport);
+
+// Update state
+state.iterations[N - 1].evaluation = {
+  score: evaluation.composite_score,
+  dimensions: evaluation.dimensions || [],
+  strengths: evaluation.strengths || [],
+  weaknesses: evaluation.weaknesses || [],
+  suggestions: evaluation.suggestions || [],
+  chain_scores: evaluation.chain_scores || null,
+  eval_file: `${iterDir}/iteration-${N}-eval.md`
+};
+state.latest_score = evaluation.composite_score;
+state.score_trend.push(evaluation.composite_score);
+
+Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+```
+
+### Step 3.5: Check Termination
+
+```javascript
+function shouldTerminate(state) {
+  // 1. Quality threshold met
+  if (state.latest_score >= state.quality_threshold) {
+    return { terminate: true, reason: 'quality_threshold_met' };
+  }
+
+  // 2. Max iterations reached
+  if (state.current_iteration >= state.max_iterations) {
+    return { terminate: true, reason: 'max_iterations_reached' };
+  }
+
+  // 3. Convergence: no improvement in last 2 iterations
+  if (state.score_trend.length >= 3) {
+    const last3 = state.score_trend.slice(-3);
+    const improvement = last3[2] - last3[0];
+    if (improvement <= 2) {
+      state.converged = true;
+      return { terminate: true, reason: 'convergence_detected' };
+    }
+  }
+
+  // 4. Error limit
+  if (state.error_count >= state.max_errors) {
+    return { terminate: true, reason: 'error_limit_reached' };
+  }
+
+  return { terminate: false };
+}
+
+const termination = shouldTerminate(state);
+if (termination.terminate) {
+  state.termination_reason = termination.reason;
+  Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+  // Skip Phase 4, go directly to Phase 5 (Report)
+} else {
+  // Continue to Phase 4 (Improve)
+}
+```
+
+## Error Handling
+
+| Error | Recovery |
+|-------|----------|
+| CLI timeout | Retry once, if still fails use score 50 with warning |
+| JSON parse failure | Extract score heuristically, save raw output |
+| No output | Default score 50, note in weaknesses |
+
+## Output
+
+- **Files**: `iteration-{N}-eval.md`
+- **State**: `iterations[N-1].evaluation`, `latest_score`, `score_trend` updated
+- **Decision**: terminate -> Phase 5, continue -> Phase 4
+- **TodoWrite**: Update current iteration score display
--- a/.claude/skills/skill-iter-tune/phases/04-improve.md
+++ b/.claude/skills/skill-iter-tune/phases/04-improve.md
@@ -0,0 +1,186 @@
+# Phase 4: Apply Improvements
+
+> **COMPACT SENTINEL [Phase 4: Improve]**
+> This phase contains 4 execution steps (Step 4.1 -- 4.4).
+> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
+> Recovery: `Read("phases/04-improve.md")`
+
+Apply targeted improvements to skill files based on evaluation suggestions. Uses a general-purpose Agent to make changes, ensuring only suggested modifications are applied.
+
+## Objective
+
+- Read evaluation suggestions from current iteration
+- Launch Agent to apply improvements in priority order
+- Document all changes made
+- Update iteration state
+
+## Execution
+
+### Step 4.1: Prepare Improvement Context
+
+```javascript
+const N = state.current_iteration;
+const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
+const evaluation = state.iterations[N - 1].evaluation;
+
+// Verify we have suggestions to apply
+if (!evaluation.suggestions || evaluation.suggestions.length === 0) {
+  // No suggestions -- skip improvement, mark iteration complete
+  state.iterations[N - 1].improvement = {
+    changes_applied: [],
+    changes_file: null,
+    improvement_rationale: 'No suggestions provided by evaluation'
+  };
+  state.iterations[N - 1].status = 'completed';
+  Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+  // -> Return to orchestrator for next iteration
+  return;
+}
+
+// Build file inventory for agent context
+const skillFileInventory = state.target_skills.map(skill => {
+  return `Skill: ${skill.name} (${skill.path})\nFiles:\n` +
+    skill.files.map(f => `  - ${f}`).join('\n');
+}).join('\n\n');
+
+// Chain mode: add chain relationship context
+const chainContext = state.execution_mode === 'chain'
+  ? `\nChain Order: ${state.chain_order.join(' -> ')}\n` +
+    `Chain Scores: ${state.chain_order.map(s =>
+      `${s}: ${state.iterations[N-1].evaluation?.chain_scores?.[s] || 'N/A'}`
+    ).join(', ')}\n` +
+    `Weakest Link: ${state.chain_order.reduce((min, s) => {
+      const score = state.iterations[N-1].evaluation?.chain_scores?.[s] || 100;
+      return score < (state.iterations[N-1].evaluation?.chain_scores?.[min] || 100) ? s : min;
+    }, state.chain_order[0])}`
+  : '';
+```
+
+### Step 4.2: Launch Improvement Agent
+
+> **CHECKPOINT**: Before launching agent, verify:
+> 1. evaluation.suggestions is non-empty
+> 2. All target_file paths in suggestions are valid
+
+```javascript
+const suggestionsText = evaluation.suggestions.map((s, i) =>
+  `${i + 1}. [${s.priority.toUpperCase()}] ${s.description}\n` +
+  `   Target: ${s.target_file}\n` +
+  `   Rationale: ${s.rationale}\n` +
+  (s.code_snippet ? `   Suggested change:\n   ${s.code_snippet}\n` : '')
+).join('\n');
+
+Agent({
+  subagent_type: 'general-purpose',
+  run_in_background: false,
+  description: `Apply skill improvements iteration ${N}`,
+  prompt: `## Task: Apply Targeted Improvements to Skill Files
+
+You are improving a workflow skill based on evaluation feedback. Apply ONLY the suggested changes -- do not refactor, add features, or "improve" beyond what is explicitly suggested.
+
+## Current Score: ${evaluation.score}/100
+Dimension breakdown:
+${evaluation.dimensions.map(d => `- ${d.name}: ${d.score}/100`).join('\n')}
+
+## Skill File Inventory
+${skillFileInventory}
+
+${chainContext ? `## Chain Context\n${chainContext}\n\nPrioritize improvements on the weakest skill in the chain. Also consider interface compatibility between adjacent skills in the chain.\n` : ''}
+
+## Improvement Suggestions (apply in priority order)
+${suggestionsText}
+
+## Rules
+1. Read each target file BEFORE modifying it
+2. Apply ONLY the suggested changes -- no unsolicited modifications
+3. If a suggestion's target_file doesn't exist, skip it and note in summary
+4. If a suggestion conflicts with existing patterns, adapt it to fit (note adaptation)
+5. Preserve existing code style, naming conventions, and structure
+6. After all changes, write a change summary to: ${iterDir}/iteration-${N}-changes.md
+
+## Changes Summary Format (write to ${iterDir}/iteration-${N}-changes.md)
+
+# Iteration ${N} Changes
+
+## Applied Suggestions
+- [high] description: what was changed in which file
+- [medium] description: what was changed in which file
+
+## Files Modified
+- path/to/file.md: brief description of changes
+
+## Skipped Suggestions (if any)
+- description: reason for skipping
+
+## Notes
+- Any adaptations or considerations
+
+## Success Criteria
+- All high-priority suggestions applied
+- Medium-priority suggestions applied if feasible
+- Low-priority suggestions applied if trivial
+- Changes summary written to ${iterDir}/iteration-${N}-changes.md
+`
+});
+```
+
+### Step 4.3: Verify Changes
+
+After agent completes:
+
+```javascript
+// Verify changes summary was written
+const changesFile = `${iterDir}/iteration-${N}-changes.md`;
+const changesExist = Glob(changesFile).length > 0;
+
+if (!changesExist) {
+  // Agent didn't write summary -- create a minimal one
+  Write(changesFile, `# Iteration ${N} Changes\n\n## Notes\nAgent completed but did not produce changes summary.\n`);
+}
+
+// Read changes summary to extract applied changes
+const changesContent = Read(changesFile);
+
+// Parse applied changes (heuristic: count lines starting with "- [")
+const appliedMatches = changesContent.match(/^- \[.+?\]/gm) || [];
+const changes_applied = appliedMatches.map(m => ({
+  summary: m.replace(/^- /, ''),
+  file: '' // Extracted from context
+}));
+```
+
+### Step 4.4: Update State
+
+```javascript
+state.iterations[N - 1].improvement = {
+  changes_applied: changes_applied,
+  changes_file: changesFile,
+  improvement_rationale: `Applied ${changes_applied.length} improvements based on evaluation score ${evaluation.score}`
+};
+state.iterations[N - 1].status = 'completed';
+state.updated_at = new Date().toISOString();
+
+// Also update the skill files list in case new files were created
+for (const skill of state.target_skills) {
+  skill.files = Glob(`${skill.path}/**/*.md`).map(f => f.replace(skill.path + '/', ''));
+}
+
+Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+
+// -> Return to orchestrator for next iteration (Phase 2) or termination check
+```
+
+## Error Handling
+
+| Error | Recovery |
+|-------|----------|
+| Agent fails to complete | Rollback from skill-snapshot: `cp -r "${iterDir}/skill-snapshot/${skill.name}/*" "${skill.path}/"` |
+| Agent corrupts files | Same rollback from snapshot |
+| Changes summary missing | Create minimal summary, continue |
+| target_file not found | Agent skips suggestion, notes in summary |
+
+## Output
+
+- **Files**: `iteration-{N}-changes.md`, modified skill files
+- **State**: `iterations[N-1].improvement` and `.status` updated
+- **Next**: Return to orchestrator, begin next iteration (Phase 2) or terminate
--- a/.claude/skills/skill-iter-tune/phases/05-report.md
+++ b/.claude/skills/skill-iter-tune/phases/05-report.md
@@ -0,0 +1,166 @@
+# Phase 5: Final Report
+
+> **COMPACT SENTINEL [Phase 5: Report]**
+> This phase contains 4 execution steps (Step 5.1 -- 5.4).
+> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
+> Recovery: `Read("phases/05-report.md")`
+
+Generate comprehensive iteration history report and display results to user.
+
+## Objective
+
+- Read complete iteration state
+- Generate formatted final report with score progression
+- Write final-report.md
+- Display summary to user
+
+## Execution
+
+### Step 5.1: Read Complete State
+
+```javascript
+const state = JSON.parse(Read(`${state.work_dir}/iteration-state.json`));
+state.status = 'completed';
+state.updated_at = new Date().toISOString();
+```
+
+### Step 5.2: Generate Report
+
+```javascript
+// Determine outcome
+const outcomeMap = {
+  quality_threshold_met: 'PASSED -- Quality threshold reached',
+  max_iterations_reached: 'MAX ITERATIONS -- Threshold not reached',
+  convergence_detected: 'CONVERGED -- Score stopped improving',
+  error_limit_reached: 'FAILED -- Too many errors'
+};
+const outcome = outcomeMap[state.termination_reason] || 'COMPLETED';
+
+// Build score progression table
+const scoreTable = state.iterations
+  .filter(i => i.evaluation)
+  .map(i => {
+    const dims = i.evaluation.dimensions || [];
+    const dimScores = ['clarity', 'completeness', 'correctness', 'effectiveness', 'efficiency']
+      .map(id => {
+        const dim = dims.find(d => d.id === id);
+        return dim ? dim.score : '-';
+      });
+    return `| ${i.round} | ${i.evaluation.score} | ${dimScores.join(' | ')} |`;
+  }).join('\n');
+
+// Build iteration details
+const iterationDetails = state.iterations.map(iter => {
+  const evalSection = iter.evaluation
+    ? `**Score**: ${iter.evaluation.score}/100\n` +
+      `**Strengths**: ${iter.evaluation.strengths?.join(', ') || 'N/A'}\n` +
+      `**Weaknesses**: ${iter.evaluation.weaknesses?.slice(0, 3).join(', ') || 'N/A'}`
+    : '**Evaluation**: Skipped or failed';
+
+  const changesSection = iter.improvement
+    ? `**Changes Applied**: ${iter.improvement.changes_applied?.length || 0}\n` +
+      (iter.improvement.changes_applied?.map(c => `  - ${c.summary}`).join('\n') || '  None')
+    : '**Improvements**: None';
+
+  return `### Iteration ${iter.round}\n${evalSection}\n${changesSection}`;
+}).join('\n\n');
+
+const report = `# Skill Iter Tune -- Final Report
+
+## Summary
+
+| Field | Value |
+|-------|-------|
+| **Target Skills** | ${state.target_skills.map(s => s.name).join(', ')} |
+| **Execution Mode** | ${state.execution_mode} |
+${state.execution_mode === 'chain' ? `| **Chain Order** | ${state.chain_order.join(' -> ')} |` : ''}
+| **Test Scenario** | ${state.test_scenario.description} |
+| **Iterations** | ${state.iterations.length} |
+| **Initial Score** | ${state.score_trend[0] || 'N/A'} |
+| **Final Score** | ${state.latest_score}/100 |
+| **Quality Threshold** | ${state.quality_threshold} |
+| **Outcome** | ${outcome} |
+| **Started** | ${state.started_at} |
+| **Completed** | ${state.updated_at} |
+
+## Score Progression
+
+| Iter | Composite | Clarity | Completeness | Correctness | Effectiveness | Efficiency |
+|------|-----------|---------|--------------|-------------|---------------|------------|
+${scoreTable}
+
+**Trend**: ${state.score_trend.join(' -> ')}
+
+${state.execution_mode === 'chain' ? `
+## Chain Score Progression
+
+| Iter | ${state.chain_order.join(' | ')} |
+|------|${state.chain_order.map(() => '------').join('|')}|
+${state.iterations.filter(i => i.evaluation?.chain_scores).map(i => {
+  const scores = state.chain_order.map(s => i.evaluation.chain_scores[s] || '-');
+  return `| ${i.round} | ${scores.join(' | ')} |`;
+}).join('\n')}
+` : ''}
+
+## Iteration Details
+
+${iterationDetails}
+
+## Remaining Weaknesses
+
+${state.iterations.length > 0 && state.iterations[state.iterations.length - 1].evaluation
+  ? state.iterations[state.iterations.length - 1].evaluation.weaknesses?.map(w => `- ${w}`).join('\n') || 'None identified'
+  : 'No evaluation data available'}
+
+## Artifact Locations
+
+| Path | Description |
+|------|-------------|
+| \`${state.work_dir}/iteration-state.json\` | Complete state history |
+| \`${state.work_dir}/iterations/iteration-{N}/iteration-{N}-eval.md\` | Per-iteration evaluations |
+| \`${state.work_dir}/iterations/iteration-{N}/iteration-{N}-changes.md\` | Per-iteration change logs |
+| \`${state.work_dir}/final-report.md\` | This report |
+| \`${state.backup_dir}/\` | Original skill backups |
+
+## Restore Original
+
+To revert all changes and restore the original skill files:
+
+\`\`\`bash
+${state.target_skills.map(s => `cp -r "${state.backup_dir}/${s.name}"/* "${s.path}/"`).join('\n')}
+\`\`\`
+`;
+```
+
+### Step 5.3: Write Report and Update State
+
+```javascript
+Write(`${state.work_dir}/final-report.md`, report);
+
+state.status = 'completed';
+Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
+```
+
+### Step 5.4: Display Summary to User
+
+Output to user:
+
+```
+Skill Iter Tune Complete!
+
+Target: {skill names}
+Iterations: {count}
+Score: {initial} -> {final} ({outcome})
+Threshold: {threshold}
+
+Score trend: {score1} -> {score2} -> ... -> {scoreN}
+
+Full report: {workDir}/final-report.md
+Backups: {backupDir}/
+```
+
+## Output
+
+- **Files**: `final-report.md`
+- **State**: `status = completed`
+- **Next**: Workflow complete. Return control to user.