Implement phases for skill iteration tuning: Evaluation, Improvement, and Reporting

- Added Phase 3: Evaluate Quality with steps for preparing context, constructing evaluation prompts, executing evaluation via CLI, parsing scores, and checking termination conditions.
- Introduced Phase 4: Apply Improvements to implement targeted changes based on evaluation suggestions, including agent execution and change documentation.
- Created Phase 5: Final Report to generate a comprehensive report of the iteration process, including score progression and remaining weaknesses.
- Established evaluation criteria in a new document to guide the evaluation process.
- Developed templates for evaluation and execution prompts to standardize input for the evaluation and execution phases.
This commit is contained in:
catlog22
2026-03-10 21:42:58 +08:00
parent b4ad8c7b80
commit 9fb13ed6b0
10 changed files with 1777 additions and 0 deletions

View File

@@ -0,0 +1,144 @@
# Phase 1: Setup
Initialize workspace, backup skills, parse inputs.
## Objective
- Parse skill path(s) and test scenario from user input
- Validate all skill paths exist and contain SKILL.md
- Create isolated workspace directory structure
- Backup original skill files
- Initialize iteration-state.json
## Execution
### Step 1.1: Parse Input
Parse `$ARGUMENTS` to extract skill paths and test scenario.
```javascript
// Parse skill paths (first argument or comma-separated)
const args = $ARGUMENTS.trim();
const pathMatch = args.match(/^([^\s]+)/);
const rawPaths = pathMatch ? pathMatch[1].split(',') : [];
// Parse test scenario
const scenarioMatch = args.match(/(?:--scenario|--test)\s+"([^"]+)"/);
const scenarioText = scenarioMatch ? scenarioMatch[1] : args.replace(rawPaths.join(','), '').trim();
// Record chain order (preserves input order for chain mode)
const chainOrder = rawPaths.map(p => p.startsWith('.claude/') ? p.split('/').pop() : p);
// If no scenario, ask user
if (!scenarioText) {
const response = AskUserQuestion({
questions: [{
question: "Please describe the test scenario for evaluating this skill:",
header: "Test Scenario",
multiSelect: false,
options: [
{ label: "General quality test", description: "Evaluate overall skill quality with a generic task" },
{ label: "Specific scenario", description: "I'll describe a specific test case" }
]
}]
});
// Use response to construct testScenario
}
```
### Step 1.2: Validate Skill Paths
```javascript
const targetSkills = [];
for (const rawPath of rawPaths) {
const skillPath = rawPath.startsWith('.claude/') ? rawPath : `.claude/skills/${rawPath}`;
// Validate SKILL.md exists
const skillFiles = Glob(`${skillPath}/SKILL.md`);
if (skillFiles.length === 0) {
throw new Error(`Skill not found at: ${skillPath} -- SKILL.md missing`);
}
// Collect all skill files
const allFiles = Glob(`${skillPath}/**/*.md`);
targetSkills.push({
name: skillPath.split('/').pop(),
path: skillPath,
files: allFiles.map(f => f.replace(skillPath + '/', '')),
primary_file: 'SKILL.md'
});
}
```
### Step 1.3: Create Workspace
```javascript
const ts = Date.now();
const workDir = `.workflow/.scratchpad/skill-iter-tune-${ts}`;
Bash(`mkdir -p "${workDir}/backups" "${workDir}/iterations"`);
```
### Step 1.4: Backup Original Skills
```javascript
for (const skill of targetSkills) {
Bash(`cp -r "${skill.path}" "${workDir}/backups/${skill.name}"`);
}
```
### Step 1.5: Initialize State
Write `iteration-state.json` with initial state:
```javascript
const initialState = {
status: 'running',
started_at: new Date().toISOString(),
updated_at: new Date().toISOString(),
target_skills: targetSkills,
test_scenario: {
description: scenarioText,
// Parse --requirements and --input-args from $ARGUMENTS if provided
// e.g., --requirements "clear output,no errors" --input-args "my-skill --scenario test"
requirements: parseListArg(args, '--requirements') || [],
input_args: parseStringArg(args, '--input-args') || '',
success_criteria: parseStringArg(args, '--success-criteria') || 'Produces correct, high-quality output'
},
execution_mode: workflowPreferences.executionMode || 'single',
chain_order: workflowPreferences.executionMode === 'chain'
? targetSkills.map(s => s.name)
: [],
current_iteration: 0,
max_iterations: workflowPreferences.maxIterations,
quality_threshold: workflowPreferences.qualityThreshold,
latest_score: 0,
score_trend: [],
converged: false,
iterations: [],
errors: [],
error_count: 0,
max_errors: 3,
work_dir: workDir,
backup_dir: `${workDir}/backups`
};
Write(`${workDir}/iteration-state.json`, JSON.stringify(initialState, null, 2));
// Chain mode: create per-skill tracking tasks
if (initialState.execution_mode === 'chain') {
for (const skill of targetSkills) {
TaskCreate({
subject: `Chain: ${skill.name}`,
activeForm: `Tracking ${skill.name}`,
description: `Skill chain member: ${skill.path} | Position: ${targetSkills.indexOf(skill) + 1}/${targetSkills.length}`
});
}
}
```
## Output
- **Variables**: `workDir`, `targetSkills[]`, `testScenario`, `chainOrder` (chain mode)
- **Files**: `iteration-state.json`, `backups/` directory with skill copies
- **TodoWrite**: Mark Phase 1 completed, start Iteration Loop. Chain mode: per-skill tracking tasks created

View File

@@ -0,0 +1,292 @@
# Phase 2: Execute Skill
> **COMPACT SENTINEL [Phase 2: Execute]**
> This phase contains 4 execution steps (Step 2.1 -- 2.4).
> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
> Recovery: `Read("phases/02-execute.md")`
Execute the target skill against the test scenario using `ccw cli --tool claude --mode write`. Claude receives the full skill definition and simulates producing its expected output artifacts.
## Objective
- Snapshot current skill version before execution
- Construct execution prompt with full skill content + test scenario
- Execute via ccw cli Claude
- Collect output artifacts
## Execution
### Step 2.1: Snapshot Current Skill
```javascript
const N = state.current_iteration;
const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
Bash(`mkdir -p "${iterDir}/skill-snapshot" "${iterDir}/artifacts"`);
// Chain mode: create per-skill artifact directories
if (state.execution_mode === 'chain') {
for (const skillName of state.chain_order) {
Bash(`mkdir -p "${iterDir}/artifacts/${skillName}"`);
}
}
// Snapshot current skill state (so we can compare/rollback)
for (const skill of state.target_skills) {
Bash(`cp -r "${skill.path}" "${iterDir}/skill-snapshot/${skill.name}"`);
}
```
### Step 2.2: Construct Execution Prompt (Single Mode)
Read the execute-prompt template and substitute variables.
> Skip to Step 2.2b if `state.execution_mode === 'chain'`.
```javascript
// Ref: templates/execute-prompt.md
// Build skillContent by reading only executable skill files (SKILL.md, phases/, specs/)
// Exclude README.md, docs/, and other non-executable files to save tokens
const skillContent = state.target_skills.map(skill => {
const skillMd = Read(`${skill.path}/SKILL.md`);
const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
return `### File: SKILL.md\n${skillMd}\n\n` +
phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
(specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
}).join('\n\n---\n\n');
// Construct full prompt using template
const executePrompt = `PURPOSE: Simulate executing the following workflow skill against a test scenario. Produce all expected output artifacts as if the skill were invoked with the given input.
SKILL CONTENT:
${skillContent}
TEST SCENARIO:
Description: ${state.test_scenario.description}
Input Arguments: ${state.test_scenario.input_args}
Requirements: ${state.test_scenario.requirements.join('; ')}
Success Criteria: ${state.test_scenario.success_criteria}
TASK:
1. Study the complete skill structure (SKILL.md + all phase files)
2. Follow the skill execution flow sequentially
3. For each phase, produce the artifacts that phase would generate
4. Write all output artifacts to the current working directory
5. Create a manifest.json listing all produced artifacts
MODE: write
CONTEXT: @**/*
EXPECTED: All artifacts written to disk + manifest.json
CONSTRAINTS: Follow skill flow exactly, produce realistic output, not placeholders`;
```
### Step 2.3: Execute via ccw cli
> **CHECKPOINT**: Before executing CLI, verify:
> 1. This phase is TodoWrite `in_progress`
> 2. `iterDir/artifacts/` directory exists
> 3. Prompt is properly escaped
```javascript
function escapeForShell(str) {
return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
}
const cliCommand = `ccw cli -p "${escapeForShell(executePrompt)}" --tool claude --mode write --cd "${iterDir}/artifacts"`;
// Execute in background, wait for hook callback
Bash({
command: cliCommand,
run_in_background: true,
timeout: 600000 // 10 minutes max
});
// STOP HERE -- wait for hook callback to resume
// After callback, verify artifacts were produced
```
### Step 2.2b: Chain Execution Path
> Skip this step if `state.execution_mode === 'single'`.
In chain mode, execute each skill sequentially. Each skill receives the previous skill's artifacts as input context.
```javascript
// Chain execution: iterate through chain_order
let previousArtifacts = ''; // Accumulates upstream output
for (let i = 0; i < state.chain_order.length; i++) {
const skillName = state.chain_order[i];
const skill = state.target_skills.find(s => s.name === skillName);
const skillArtifactDir = `${iterDir}/artifacts/${skillName}`;
// Build this skill's content
const skillMd = Read(`${skill.path}/SKILL.md`);
const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const singleSkillContent = `### File: SKILL.md\n${skillMd}\n\n` +
phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
(specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
// Build chain context from previous skill's artifacts
const chainInputContext = previousArtifacts
? `\nPREVIOUS CHAIN OUTPUT (from upstream skill "${state.chain_order[i - 1]}"):\n${previousArtifacts}\n\nIMPORTANT: Use the above output as input context for this skill's execution.\n`
: '';
// Construct per-skill execution prompt
// Ref: templates/execute-prompt.md
const chainPrompt = `PURPOSE: Simulate executing the following workflow skill against a test scenario. Produce all expected output artifacts.
SKILL CONTENT (${skillName} — chain position ${i + 1}/${state.chain_order.length}):
${singleSkillContent}
${chainInputContext}
TEST SCENARIO:
Description: ${state.test_scenario.description}
Input Arguments: ${state.test_scenario.input_args}
Requirements: ${state.test_scenario.requirements.join('; ')}
Success Criteria: ${state.test_scenario.success_criteria}
TASK:
1. Study the complete skill structure
2. Follow the skill execution flow sequentially
3. Produce all expected artifacts
4. Write output to the current working directory
5. Create manifest.json listing all produced artifacts
MODE: write
CONTEXT: @**/*
CONSTRAINTS: Follow skill flow exactly, produce realistic output`;
function escapeForShell(str) {
return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
}
const cliCommand = `ccw cli -p "${escapeForShell(chainPrompt)}" --tool claude --mode write --cd "${skillArtifactDir}"`;
// Execute in background
Bash({
command: cliCommand,
run_in_background: true,
timeout: 600000
});
// STOP -- wait for hook callback
// After callback: collect artifacts for next skill in chain
const artifacts = Glob(`${skillArtifactDir}/**/*`);
const skillSuccess = artifacts.length > 0;
if (skillSuccess) {
previousArtifacts = artifacts.slice(0, 10).map(f => {
const relPath = f.replace(skillArtifactDir + '/', '');
const content = Read(f, { limit: 100 });
return `--- ${relPath} ---\n${content}`;
}).join('\n\n');
} else {
// Mid-chain failure: keep previous artifacts for downstream skills
// Log warning but continue chain — downstream skills receive last successful output
state.errors.push({
phase: 'execute',
message: `Chain skill "${skillName}" (position ${i + 1}) produced no artifacts. Downstream skills will receive upstream output from "${state.chain_order[i - 1] || 'none'}" instead.`,
timestamp: new Date().toISOString()
});
state.error_count++;
// previousArtifacts remains from last successful skill (or empty if first)
}
// Update per-skill TodoWrite
// TaskUpdate chain skill task with execution status
// Record per-skill execution
if (!state.iterations[N - 1].execution.chain_executions) {
state.iterations[N - 1].execution.chain_executions = [];
}
state.iterations[N - 1].execution.chain_executions.push({
skill_name: skillName,
cli_command: cliCommand,
artifacts_dir: skillArtifactDir,
success: skillSuccess
});
// Check error budget: abort chain if too many consecutive failures
if (state.error_count >= 3) {
state.errors.push({
phase: 'execute',
message: `Chain execution aborted at skill "${skillName}" — error limit reached (${state.error_count} errors).`,
timestamp: new Date().toISOString()
});
break;
}
}
```
### Step 2.4: Collect Artifacts
After CLI completes (hook callback received):
```javascript
// List produced artifacts
const artifactFiles = Glob(`${iterDir}/artifacts/**/*`);
// Chain mode: check per-skill artifacts
if (state.execution_mode === 'chain') {
const chainSuccess = state.iterations[N - 1].execution.chain_executions?.every(e => e.success) ?? false;
state.iterations[N - 1].execution.success = chainSuccess;
state.iterations[N - 1].execution.artifacts_dir = `${iterDir}/artifacts`;
} else {
if (artifactFiles.length === 0) {
// Execution produced nothing -- record error
state.iterations[N - 1].execution = {
cli_command: cliCommand,
started_at: new Date().toISOString(),
completed_at: new Date().toISOString(),
artifacts_dir: `${iterDir}/artifacts`,
success: false
};
state.error_count++;
// Continue to Phase 3 anyway -- Gemini can evaluate the skill even without artifacts
} else {
state.iterations[N - 1].execution = {
cli_command: cliCommand,
started_at: new Date().toISOString(),
completed_at: new Date().toISOString(),
artifacts_dir: `${iterDir}/artifacts`,
success: true
};
}
} // end single mode branch
// Update state
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
```
## Error Handling
| Error | Recovery |
|-------|----------|
| CLI timeout (10min) | Record failure, continue to Phase 3 without artifacts |
| CLI crash | Retry once with simplified prompt (SKILL.md only, no phase files) |
| No artifacts produced | Continue to Phase 3, evaluation focuses on skill definition quality |
## Output
- **Files**: `iteration-{N}/skill-snapshot/`, `iteration-{N}/artifacts/`
- **State**: `iterations[N-1].execution` updated
- **Next**: Phase 3 (Evaluate)

View File

@@ -0,0 +1,312 @@
# Phase 3: Evaluate Quality
> **COMPACT SENTINEL [Phase 3: Evaluate]**
> This phase contains 5 execution steps (Step 3.1 -- 3.5).
> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
> Recovery: `Read("phases/03-evaluate.md")`
Evaluate skill quality using `ccw cli --tool gemini --mode analysis`. Gemini scores the skill across 5 dimensions and provides improvement suggestions.
## Objective
- Construct evaluation prompt with skill + artifacts + criteria
- Execute via ccw cli Gemini
- Parse multi-dimensional score
- Write iteration-{N}-eval.md
- Check termination conditions
## Execution
### Step 3.1: Prepare Evaluation Context
```javascript
const N = state.current_iteration;
const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
// Read evaluation criteria
// Ref: specs/evaluation-criteria.md
const evaluationCriteria = Read('.claude/skills/skill-iter-tune/specs/evaluation-criteria.md');
// Build skillContent (same pattern as Phase 02 — only executable files)
const skillContent = state.target_skills.map(skill => {
const skillMd = Read(`${skill.path}/SKILL.md`);
const phaseFiles = Glob(`${skill.path}/phases/*.md`).sort().map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
return `### File: SKILL.md\n${skillMd}\n\n` +
phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') +
(specFiles.length > 0 ? '\n\n' + specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : '');
}).join('\n\n---\n\n');
// Build artifacts summary
let artifactsSummary = 'No artifacts produced (execution may have failed)';
if (state.execution_mode === 'chain') {
// Chain mode: group artifacts by skill
const chainSummaries = state.chain_order.map(skillName => {
const skillArtifactDir = `${iterDir}/artifacts/${skillName}`;
const files = Glob(`${skillArtifactDir}/**/*`);
if (files.length === 0) return `### ${skillName} (no artifacts)`;
const filesSummary = files.map(f => {
const relPath = f.replace(`${skillArtifactDir}/`, '');
const content = Read(f, { limit: 200 });
return `--- ${relPath} ---\n${content}`;
}).join('\n\n');
return `### ${skillName} (chain position ${state.chain_order.indexOf(skillName) + 1})\n${filesSummary}`;
});
artifactsSummary = chainSummaries.join('\n\n---\n\n');
} else {
// Single mode (existing)
const artifactFiles = Glob(`${iterDir}/artifacts/**/*`);
if (artifactFiles.length > 0) {
artifactsSummary = artifactFiles.map(f => {
const relPath = f.replace(`${iterDir}/artifacts/`, '');
const content = Read(f, { limit: 200 });
return `--- ${relPath} ---\n${content}`;
}).join('\n\n');
}
}
// Build previous evaluation context
const previousEvalContext = state.iterations.filter(i => i.evaluation).length > 0
? `PREVIOUS ITERATIONS:\n` + state.iterations.filter(i => i.evaluation).map(iter =>
`Iteration ${iter.round}: Score ${iter.evaluation.score}\n` +
` Applied: ${iter.improvement?.changes_applied?.map(c => c.summary).join('; ') || 'none'}\n` +
` Weaknesses: ${iter.evaluation.weaknesses?.slice(0, 3).join('; ') || 'none'}`
).join('\n') + '\nIMPORTANT: Focus on NEW issues not yet addressed.'
: '';
```
### Step 3.2: Construct Evaluation Prompt
```javascript
// Ref: templates/eval-prompt.md
const evalPrompt = `PURPOSE: Evaluate the quality of a workflow skill by examining its definition and produced artifacts.
SKILL DEFINITION:
${skillContent}
TEST SCENARIO:
${state.test_scenario.description}
Requirements: ${state.test_scenario.requirements.join('; ')}
Success Criteria: ${state.test_scenario.success_criteria}
ARTIFACTS PRODUCED:
${artifactsSummary}
EVALUATION CRITERIA:
${evaluationCriteria}
${previousEvalContext}
${state.execution_mode === 'chain' ? `
CHAIN CONTEXT:
This skill chain contains ${state.chain_order.length} skills executed in order:
${state.chain_order.map((s, i) => `${i+1}. ${s}`).join('\n')}
Current evaluation covers the entire chain output.
Please provide per-skill quality scores in an additional "chain_scores" field: { "${state.chain_order[0]}": <score>, ... }
` : ''}
TASK:
1. Score each dimension (Clarity 0.20, Completeness 0.25, Correctness 0.25, Effectiveness 0.20, Efficiency 0.10) on 0-100
2. Calculate weighted composite score
3. List top 3 strengths
4. List top 3-5 weaknesses with file:section references
5. Provide 3-5 prioritized improvement suggestions with concrete changes
EXPECTED OUTPUT (strict JSON, no markdown):
{
"composite_score": <0-100>,
"dimensions": [
{"name":"Clarity","id":"clarity","score":<0-100>,"weight":0.20,"feedback":"..."},
{"name":"Completeness","id":"completeness","score":<0-100>,"weight":0.25,"feedback":"..."},
{"name":"Correctness","id":"correctness","score":<0-100>,"weight":0.25,"feedback":"..."},
{"name":"Effectiveness","id":"effectiveness","score":<0-100>,"weight":0.20,"feedback":"..."},
{"name":"Efficiency","id":"efficiency","score":<0-100>,"weight":0.10,"feedback":"..."}
],
"strengths": ["...", "...", "..."],
"weaknesses": ["...with file:section ref...", "..."],
"suggestions": [
{"priority":"high|medium|low","target_file":"...","description":"...","rationale":"...","code_snippet":"..."}
]
}
CONSTRAINTS: Be rigorous, reference exact files, focus on highest-impact changes, output ONLY JSON`;
```
### Step 3.3: Execute via ccw cli Gemini
> **CHECKPOINT**: Verify evaluation prompt is properly constructed before CLI execution.
```javascript
// Shell escape utility (same as Phase 02)
function escapeForShell(str) {
return str.replace(/"/g, '\\"').replace(/\$/g, '\\$').replace(/`/g, '\\`');
}
const skillPath = state.target_skills[0].path; // Primary skill for --cd
const cliCommand = `ccw cli -p "${escapeForShell(evalPrompt)}" --tool gemini --mode analysis --cd "${skillPath}"`;
// Execute in background
Bash({
command: cliCommand,
run_in_background: true,
timeout: 300000 // 5 minutes
});
// STOP -- wait for hook callback
```
### Step 3.4: Parse Score and Write Eval File
After CLI completes:
```javascript
// Parse JSON from Gemini output
// The output may contain markdown wrapping -- extract JSON
const rawOutput = /* CLI output from callback */;
const jsonMatch = rawOutput.match(/\{[\s\S]*\}/);
let evaluation;
if (jsonMatch) {
try {
evaluation = JSON.parse(jsonMatch[0]);
// Extract chain_scores if present
if (state.execution_mode === 'chain' && evaluation.chain_scores) {
state.iterations[N - 1].evaluation.chain_scores = evaluation.chain_scores;
}
} catch (e) {
// Fallback: try to extract score heuristically
const scoreMatch = rawOutput.match(/"composite_score"\s*:\s*(\d+)/);
evaluation = {
composite_score: scoreMatch ? parseInt(scoreMatch[1]) : 50,
dimensions: [],
strengths: [],
weaknesses: ['Evaluation output parsing failed -- raw output saved'],
suggestions: []
};
}
} else {
evaluation = {
composite_score: 50,
dimensions: [],
strengths: [],
weaknesses: ['No structured evaluation output -- defaulting to 50'],
suggestions: []
};
}
// Write iteration-N-eval.md
const evalReport = `# Iteration ${N} Evaluation
**Composite Score**: ${evaluation.composite_score}/100
**Date**: ${new Date().toISOString()}
## Dimension Scores
| Dimension | Score | Weight | Feedback |
|-----------|-------|--------|----------|
${(evaluation.dimensions || []).map(d =>
`| ${d.name} | ${d.score} | ${d.weight} | ${d.feedback} |`
).join('\n')}
${(state.execution_mode === 'chain' && evaluation.chain_scores) ? `
## Chain Scores
| Skill | Score | Chain Position |
|-------|-------|----------------|
${state.chain_order.map((s, i) => `| ${s} | ${evaluation.chain_scores[s] || '-'} | ${i + 1} |`).join('\n')}
` : ''}
## Strengths
${(evaluation.strengths || []).map(s => `- ${s}`).join('\n')}
## Weaknesses
${(evaluation.weaknesses || []).map(w => `- ${w}`).join('\n')}
## Improvement Suggestions
${(evaluation.suggestions || []).map((s, i) =>
`### ${i + 1}. [${s.priority}] ${s.description}\n- **Target**: ${s.target_file}\n- **Rationale**: ${s.rationale}\n${s.code_snippet ? `- **Suggested**:\n\`\`\`\n${s.code_snippet}\n\`\`\`` : ''}`
).join('\n\n')}
`;
Write(`${iterDir}/iteration-${N}-eval.md`, evalReport);
// Update state
state.iterations[N - 1].evaluation = {
score: evaluation.composite_score,
dimensions: evaluation.dimensions || [],
strengths: evaluation.strengths || [],
weaknesses: evaluation.weaknesses || [],
suggestions: evaluation.suggestions || [],
chain_scores: evaluation.chain_scores || null,
eval_file: `${iterDir}/iteration-${N}-eval.md`
};
state.latest_score = evaluation.composite_score;
state.score_trend.push(evaluation.composite_score);
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
```
### Step 3.5: Check Termination
```javascript
function shouldTerminate(state) {
// 1. Quality threshold met
if (state.latest_score >= state.quality_threshold) {
return { terminate: true, reason: 'quality_threshold_met' };
}
// 2. Max iterations reached
if (state.current_iteration >= state.max_iterations) {
return { terminate: true, reason: 'max_iterations_reached' };
}
// 3. Convergence: no improvement in last 2 iterations
if (state.score_trend.length >= 3) {
const last3 = state.score_trend.slice(-3);
const improvement = last3[2] - last3[0];
if (improvement <= 2) {
state.converged = true;
return { terminate: true, reason: 'convergence_detected' };
}
}
// 4. Error limit
if (state.error_count >= state.max_errors) {
return { terminate: true, reason: 'error_limit_reached' };
}
return { terminate: false };
}
const termination = shouldTerminate(state);
if (termination.terminate) {
state.termination_reason = termination.reason;
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
// Skip Phase 4, go directly to Phase 5 (Report)
} else {
// Continue to Phase 4 (Improve)
}
```
## Error Handling
| Error | Recovery |
|-------|----------|
| CLI timeout | Retry once, if still fails use score 50 with warning |
| JSON parse failure | Extract score heuristically, save raw output |
| No output | Default score 50, note in weaknesses |
## Output
- **Files**: `iteration-{N}-eval.md`
- **State**: `iterations[N-1].evaluation`, `latest_score`, `score_trend` updated
- **Decision**: terminate -> Phase 5, continue -> Phase 4
- **TodoWrite**: Update current iteration score display

View File

@@ -0,0 +1,186 @@
# Phase 4: Apply Improvements
> **COMPACT SENTINEL [Phase 4: Improve]**
> This phase contains 4 execution steps (Step 4.1 -- 4.4).
> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
> Recovery: `Read("phases/04-improve.md")`
Apply targeted improvements to skill files based on evaluation suggestions. Uses a general-purpose Agent to make changes, ensuring only suggested modifications are applied.
## Objective
- Read evaluation suggestions from current iteration
- Launch Agent to apply improvements in priority order
- Document all changes made
- Update iteration state
## Execution
### Step 4.1: Prepare Improvement Context
```javascript
const N = state.current_iteration;
const iterDir = `${state.work_dir}/iterations/iteration-${N}`;
const evaluation = state.iterations[N - 1].evaluation;
// Verify we have suggestions to apply
if (!evaluation.suggestions || evaluation.suggestions.length === 0) {
// No suggestions -- skip improvement, mark iteration complete
state.iterations[N - 1].improvement = {
changes_applied: [],
changes_file: null,
improvement_rationale: 'No suggestions provided by evaluation'
};
state.iterations[N - 1].status = 'completed';
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
// -> Return to orchestrator for next iteration
return;
}
// Build file inventory for agent context
const skillFileInventory = state.target_skills.map(skill => {
return `Skill: ${skill.name} (${skill.path})\nFiles:\n` +
skill.files.map(f => ` - ${f}`).join('\n');
}).join('\n\n');
// Chain mode: add chain relationship context
const chainContext = state.execution_mode === 'chain'
? `\nChain Order: ${state.chain_order.join(' -> ')}\n` +
`Chain Scores: ${state.chain_order.map(s =>
`${s}: ${state.iterations[N-1].evaluation?.chain_scores?.[s] || 'N/A'}`
).join(', ')}\n` +
`Weakest Link: ${state.chain_order.reduce((min, s) => {
const score = state.iterations[N-1].evaluation?.chain_scores?.[s] || 100;
return score < (state.iterations[N-1].evaluation?.chain_scores?.[min] || 100) ? s : min;
}, state.chain_order[0])}`
: '';
```
### Step 4.2: Launch Improvement Agent
> **CHECKPOINT**: Before launching agent, verify:
> 1. evaluation.suggestions is non-empty
> 2. All target_file paths in suggestions are valid
```javascript
const suggestionsText = evaluation.suggestions.map((s, i) =>
`${i + 1}. [${s.priority.toUpperCase()}] ${s.description}\n` +
` Target: ${s.target_file}\n` +
` Rationale: ${s.rationale}\n` +
(s.code_snippet ? ` Suggested change:\n ${s.code_snippet}\n` : '')
).join('\n');
Agent({
subagent_type: 'general-purpose',
run_in_background: false,
description: `Apply skill improvements iteration ${N}`,
prompt: `## Task: Apply Targeted Improvements to Skill Files
You are improving a workflow skill based on evaluation feedback. Apply ONLY the suggested changes -- do not refactor, add features, or "improve" beyond what is explicitly suggested.
## Current Score: ${evaluation.score}/100
Dimension breakdown:
${evaluation.dimensions.map(d => `- ${d.name}: ${d.score}/100`).join('\n')}
## Skill File Inventory
${skillFileInventory}
${chainContext ? `## Chain Context\n${chainContext}\n\nPrioritize improvements on the weakest skill in the chain. Also consider interface compatibility between adjacent skills in the chain.\n` : ''}
## Improvement Suggestions (apply in priority order)
${suggestionsText}
## Rules
1. Read each target file BEFORE modifying it
2. Apply ONLY the suggested changes -- no unsolicited modifications
3. If a suggestion's target_file doesn't exist, skip it and note in summary
4. If a suggestion conflicts with existing patterns, adapt it to fit (note adaptation)
5. Preserve existing code style, naming conventions, and structure
6. After all changes, write a change summary to: ${iterDir}/iteration-${N}-changes.md
## Changes Summary Format (write to ${iterDir}/iteration-${N}-changes.md)
# Iteration ${N} Changes
## Applied Suggestions
- [high] description: what was changed in which file
- [medium] description: what was changed in which file
## Files Modified
- path/to/file.md: brief description of changes
## Skipped Suggestions (if any)
- description: reason for skipping
## Notes
- Any adaptations or considerations
## Success Criteria
- All high-priority suggestions applied
- Medium-priority suggestions applied if feasible
- Low-priority suggestions applied if trivial
- Changes summary written to ${iterDir}/iteration-${N}-changes.md
`
});
```
### Step 4.3: Verify Changes
After agent completes:
```javascript
// Verify changes summary was written
const changesFile = `${iterDir}/iteration-${N}-changes.md`;
const changesExist = Glob(changesFile).length > 0;
if (!changesExist) {
// Agent didn't write summary -- create a minimal one
Write(changesFile, `# Iteration ${N} Changes\n\n## Notes\nAgent completed but did not produce changes summary.\n`);
}
// Read changes summary to extract applied changes
const changesContent = Read(changesFile);
// Parse applied changes (heuristic: count lines starting with "- [")
const appliedMatches = changesContent.match(/^- \[.+?\]/gm) || [];
const changes_applied = appliedMatches.map(m => ({
summary: m.replace(/^- /, ''),
file: '' // Extracted from context
}));
```
### Step 4.4: Update State
```javascript
state.iterations[N - 1].improvement = {
changes_applied: changes_applied,
changes_file: changesFile,
improvement_rationale: `Applied ${changes_applied.length} improvements based on evaluation score ${evaluation.score}`
};
state.iterations[N - 1].status = 'completed';
state.updated_at = new Date().toISOString();
// Also update the skill files list in case new files were created
for (const skill of state.target_skills) {
skill.files = Glob(`${skill.path}/**/*.md`).map(f => f.replace(skill.path + '/', ''));
}
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
// -> Return to orchestrator for next iteration (Phase 2) or termination check
```
## Error Handling
| Error | Recovery |
|-------|----------|
| Agent fails to complete | Rollback from skill-snapshot: `cp -r "${iterDir}/skill-snapshot/${skill.name}/*" "${skill.path}/"` |
| Agent corrupts files | Same rollback from snapshot |
| Changes summary missing | Create minimal summary, continue |
| target_file not found | Agent skips suggestion, notes in summary |
## Output
- **Files**: `iteration-{N}-changes.md`, modified skill files
- **State**: `iterations[N-1].improvement` and `.status` updated
- **Next**: Return to orchestrator, begin next iteration (Phase 2) or terminate

View File

@@ -0,0 +1,166 @@
# Phase 5: Final Report
> **COMPACT SENTINEL [Phase 5: Report]**
> This phase contains 4 execution steps (Step 5.1 -- 5.4).
> If you can read this sentinel but cannot find the full Step protocol below, context has been compressed.
> Recovery: `Read("phases/05-report.md")`
Generate comprehensive iteration history report and display results to user.
## Objective
- Read complete iteration state
- Generate formatted final report with score progression
- Write final-report.md
- Display summary to user
## Execution
### Step 5.1: Read Complete State
```javascript
const state = JSON.parse(Read(`${state.work_dir}/iteration-state.json`));
state.status = 'completed';
state.updated_at = new Date().toISOString();
```
### Step 5.2: Generate Report
```javascript
// Determine outcome
const outcomeMap = {
quality_threshold_met: 'PASSED -- Quality threshold reached',
max_iterations_reached: 'MAX ITERATIONS -- Threshold not reached',
convergence_detected: 'CONVERGED -- Score stopped improving',
error_limit_reached: 'FAILED -- Too many errors'
};
const outcome = outcomeMap[state.termination_reason] || 'COMPLETED';
// Build score progression table
const scoreTable = state.iterations
.filter(i => i.evaluation)
.map(i => {
const dims = i.evaluation.dimensions || [];
const dimScores = ['clarity', 'completeness', 'correctness', 'effectiveness', 'efficiency']
.map(id => {
const dim = dims.find(d => d.id === id);
return dim ? dim.score : '-';
});
return `| ${i.round} | ${i.evaluation.score} | ${dimScores.join(' | ')} |`;
}).join('\n');
// Build iteration details
const iterationDetails = state.iterations.map(iter => {
const evalSection = iter.evaluation
? `**Score**: ${iter.evaluation.score}/100\n` +
`**Strengths**: ${iter.evaluation.strengths?.join(', ') || 'N/A'}\n` +
`**Weaknesses**: ${iter.evaluation.weaknesses?.slice(0, 3).join(', ') || 'N/A'}`
: '**Evaluation**: Skipped or failed';
const changesSection = iter.improvement
? `**Changes Applied**: ${iter.improvement.changes_applied?.length || 0}\n` +
(iter.improvement.changes_applied?.map(c => ` - ${c.summary}`).join('\n') || ' None')
: '**Improvements**: None';
return `### Iteration ${iter.round}\n${evalSection}\n${changesSection}`;
}).join('\n\n');
const report = `# Skill Iter Tune -- Final Report
## Summary
| Field | Value |
|-------|-------|
| **Target Skills** | ${state.target_skills.map(s => s.name).join(', ')} |
| **Execution Mode** | ${state.execution_mode} |
${state.execution_mode === 'chain' ? `| **Chain Order** | ${state.chain_order.join(' -> ')} |` : ''}
| **Test Scenario** | ${state.test_scenario.description} |
| **Iterations** | ${state.iterations.length} |
| **Initial Score** | ${state.score_trend[0] || 'N/A'} |
| **Final Score** | ${state.latest_score}/100 |
| **Quality Threshold** | ${state.quality_threshold} |
| **Outcome** | ${outcome} |
| **Started** | ${state.started_at} |
| **Completed** | ${state.updated_at} |
## Score Progression
| Iter | Composite | Clarity | Completeness | Correctness | Effectiveness | Efficiency |
|------|-----------|---------|--------------|-------------|---------------|------------|
${scoreTable}
**Trend**: ${state.score_trend.join(' -> ')}
${state.execution_mode === 'chain' ? `
## Chain Score Progression
| Iter | ${state.chain_order.join(' | ')} |
|------|${state.chain_order.map(() => '------').join('|')}|
${state.iterations.filter(i => i.evaluation?.chain_scores).map(i => {
const scores = state.chain_order.map(s => i.evaluation.chain_scores[s] || '-');
return `| ${i.round} | ${scores.join(' | ')} |`;
}).join('\n')}
` : ''}
## Iteration Details
${iterationDetails}
## Remaining Weaknesses
${state.iterations.length > 0 && state.iterations[state.iterations.length - 1].evaluation
? state.iterations[state.iterations.length - 1].evaluation.weaknesses?.map(w => `- ${w}`).join('\n') || 'None identified'
: 'No evaluation data available'}
## Artifact Locations
| Path | Description |
|------|-------------|
| \`${state.work_dir}/iteration-state.json\` | Complete state history |
| \`${state.work_dir}/iterations/iteration-{N}/iteration-{N}-eval.md\` | Per-iteration evaluations |
| \`${state.work_dir}/iterations/iteration-{N}/iteration-{N}-changes.md\` | Per-iteration change logs |
| \`${state.work_dir}/final-report.md\` | This report |
| \`${state.backup_dir}/\` | Original skill backups |
## Restore Original
To revert all changes and restore the original skill files:
\`\`\`bash
${state.target_skills.map(s => `cp -r "${state.backup_dir}/${s.name}"/* "${s.path}/"`).join('\n')}
\`\`\`
`;
```
### Step 5.3: Write Report and Update State
```javascript
Write(`${state.work_dir}/final-report.md`, report);
state.status = 'completed';
Write(`${state.work_dir}/iteration-state.json`, JSON.stringify(state, null, 2));
```
### Step 5.4: Display Summary to User
Output to user:
```
Skill Iter Tune Complete!
Target: {skill names}
Iterations: {count}
Score: {initial} -> {final} ({outcome})
Threshold: {threshold}
Score trend: {score1} -> {score2} -> ... -> {scoreN}
Full report: {workDir}/final-report.md
Backups: {backupDir}/
```
## Output
- **Files**: `final-report.md`
- **State**: `status = completed`
- **Next**: Workflow complete. Return control to user.