Implement phases for skill iteration tuning: Evaluation, Improvement, and Reporting

- Added Phase 3: Evaluate Quality with steps for preparing context, constructing evaluation prompts, executing evaluation via CLI, parsing scores, and checking termination conditions.
- Introduced Phase 4: Apply Improvements to implement targeted changes based on evaluation suggestions, including agent execution and change documentation.
- Created Phase 5: Final Report to generate a comprehensive report of the iteration process, including score progression and remaining weaknesses.
- Established evaluation criteria in a new document to guide the evaluation process.
- Developed templates for evaluation and execution prompts to standardize input for the evaluation and execution phases.
This commit is contained in:
catlog22
2026-03-10 21:42:58 +08:00
parent b4ad8c7b80
commit 9fb13ed6b0
10 changed files with 1777 additions and 0 deletions

View File

@@ -0,0 +1,134 @@
# Evaluation Prompt Template
Phase 03 使用此模板构造 ccw cli 提示词,让 Gemini 按多维度评估 skill 质量。
## Template
```
PURPOSE: Evaluate the quality of a workflow skill by examining both its definition files and the artifacts it produced when executed against a test scenario. Provide a structured multi-dimensional score with actionable improvement suggestions.
SKILL DEFINITION:
${skillContent}
TEST SCENARIO:
${testScenario.description}
Requirements: ${testScenario.requirements}
Success Criteria: ${testScenario.success_criteria}
ARTIFACTS PRODUCED:
${artifactsSummary}
EVALUATION CRITERIA:
${evaluationCriteria}
${previousEvalContext}
TASK:
1. Read all skill definition files and produced artifacts carefully
2. Score each dimension on 0-100 based on the evaluation criteria:
- Clarity (weight 0.20): Instructions unambiguous, well-structured, easy to follow
- Completeness (weight 0.25): All phases, edge cases, error handling covered
- Correctness (weight 0.25): Logic sound, data flow consistent, no contradictions
- Effectiveness (weight 0.20): Produces high-quality output for the test scenario
- Efficiency (weight 0.10): Minimal redundancy, appropriate context usage
3. Calculate weighted composite score
4. List top 3 strengths
5. List top 3-5 weaknesses with specific file:section references
6. Provide 3-5 prioritized improvement suggestions with concrete changes
MODE: analysis
EXPECTED OUTPUT FORMAT (strict JSON, no markdown wrapping):
{
"composite_score": <number 0-100>,
"dimensions": [
{ "name": "Clarity", "id": "clarity", "score": <0-100>, "weight": 0.20, "feedback": "<specific feedback>" },
{ "name": "Completeness", "id": "completeness", "score": <0-100>, "weight": 0.25, "feedback": "<specific feedback>" },
{ "name": "Correctness", "id": "correctness", "score": <0-100>, "weight": 0.25, "feedback": "<specific feedback>" },
{ "name": "Effectiveness", "id": "effectiveness", "score": <0-100>, "weight": 0.20, "feedback": "<specific feedback>" },
{ "name": "Efficiency", "id": "efficiency", "score": <0-100>, "weight": 0.10, "feedback": "<specific feedback>" }
],
"strengths": ["<strength 1>", "<strength 2>", "<strength 3>"],
"weaknesses": ["<weakness 1 with file:section reference>", "..."],
"suggestions": [
{
"priority": "high|medium|low",
"target_file": "<relative path to skill file>",
"description": "<what to change>",
"rationale": "<why this improves quality>",
"code_snippet": "<optional: suggested replacement content>"
}
],
"chain_scores": {
"<skill_name>": "<number 0-100, per-skill score — only present in chain mode>"
}
}
CONSTRAINTS:
- Be rigorous and specific — reference exact file paths and sections
- Each suggestion MUST include a target_file that maps to a skill file
- Focus suggestions on highest-impact changes first
- Do NOT suggest changes already addressed in previous iterations
- Output ONLY the JSON object, no surrounding text or markdown
```
## Variable Substitution
| Variable | Source | Description |
|----------|--------|-------------|
| `${skillContent}` | Same as execute-prompt.md | 完整 skill 文件内容 |
| `${testScenario.*}` | iteration-state.json | 测试场景信息 |
| `${artifactsSummary}` | Phase 03 reads artifacts/ dir | 产出物文件列表 + 内容摘要 |
| `${evaluationCriteria}` | specs/evaluation-criteria.md | 评分标准全文 |
| `${previousEvalContext}` | 历史迭代记录 | 前几轮评估摘要(避免重复建议) |
| `${chainContext}` | Phase 03 constructs | chain 模式下的链上下文信息 |
## previousEvalContext Construction
```javascript
// Build context from prior iterations to avoid repeating suggestions
const previousEvalContext = state.iterations.length > 0
? `PREVIOUS ITERATIONS (context for avoiding duplicate suggestions):
${state.iterations.map(iter => `
Iteration ${iter.round}: Score ${iter.evaluation?.score || 'N/A'}
Applied changes: ${iter.improvement?.changes_applied?.map(c => c.summary).join('; ') || 'none'}
Remaining weaknesses: ${iter.evaluation?.weaknesses?.slice(0, 3).join('; ') || 'none'}
`).join('')}
IMPORTANT: Focus on NEW issues or issues NOT adequately addressed in previous improvements.`
: '';
```
## chainContext Construction
```javascript
// Build chain context for evaluation (chain mode only)
const chainContext = state.execution_mode === 'chain'
? `CHAIN CONTEXT:
This skill chain contains ${state.chain_order.length} skills executed in order:
${state.chain_order.map((s, i) => `${i+1}. ${s}`).join('\n')}
Current evaluation covers the entire chain output.
Please provide per-skill quality scores in an additional "chain_scores" field.`
: '';
```
## artifactsSummary Construction
```javascript
// Read manifest.json if available, otherwise list files
const manifestPath = `${iterDir}/artifacts/manifest.json`;
let artifactsSummary;
if (fileExists(manifestPath)) {
const manifest = JSON.parse(Read(manifestPath));
artifactsSummary = manifest.artifacts.map(a =>
`- ${a.path}: ${a.description} (Phase ${a.phase})`
).join('\n');
} else {
// Fallback: list all files with first 200 lines each
const files = Glob(`${iterDir}/artifacts/**/*`);
artifactsSummary = files.map(f => {
const content = Read(f, { limit: 200 });
return `--- ${f.replace(iterDir + '/artifacts/', '')} ---\n${content}`;
}).join('\n\n');
}
```

View File

@@ -0,0 +1,97 @@
# Execute Prompt Template
Phase 02 使用此模板构造 ccw cli 提示词,让 Claude 模拟执行 skill 并产出所有预期产物。
## Template
```
PURPOSE: Simulate executing the following workflow skill against a test scenario. Produce all expected output artifacts as if the skill were invoked with the given input. This is for evaluating skill quality.
SKILL CONTENT:
${skillContent}
TEST SCENARIO:
Description: ${testScenario.description}
Input Arguments: ${testScenario.input_args}
Requirements: ${testScenario.requirements}
Success Criteria: ${testScenario.success_criteria}
TASK:
1. Study the complete skill structure (SKILL.md + all phase files)
2. Follow the skill's execution flow sequentially (Phase 1 → Phase N)
3. For each phase, produce the artifacts that phase would generate
4. Write all output artifacts to the current working directory
5. Create a manifest.json listing all produced artifacts with descriptions
MODE: write
CONTEXT: @**/*
EXPECTED:
- All artifacts the skill would produce for this test scenario
- Each artifact in its correct relative path
- A manifest.json at root: { "artifacts": [{ "path": "...", "description": "...", "phase": N }] }
CONSTRAINTS:
- Follow the skill execution flow exactly — do not skip or reorder phases
- Produce realistic, high-quality output (not placeholder content)
- If the skill requires user interaction (AskUserQuestion), use reasonable defaults
- If the skill invokes external tools/CLIs, document what would be called but produce expected output directly
```
## Variable Substitution
| Variable | Source | Description |
|----------|--------|-------------|
| `${skillContent}` | Phase 02 reads all skill files | 完整 SKILL.md + phase 文件内容,用 markdown headers 分隔 |
| `${testScenario.description}` | iteration-state.json | 用户描述的测试场景 |
| `${testScenario.input_args}` | iteration-state.json | 模拟传给 skill 的参数 |
| `${testScenario.requirements}` | iteration-state.json | 质量要求列表 |
| `${testScenario.success_criteria}` | iteration-state.json | 成功标准定义 |
## Chain Mode Extension
When running in chain mode, the template is invoked once per skill in `chain_order`. Each invocation includes:
### Additional Variable
| Variable | Source | Description |
|----------|--------|-------------|
| `${previousChainOutput}` | Phase 02 chain loop | 前序 skill 的 artifacts 摘要 (chain 模式下非首个 skill) |
### Chain Prompt Modification
When `execution_mode === 'chain'`, the prompt includes:
```
PREVIOUS CHAIN OUTPUT (from upstream skill "${previousSkillName}"):
${previousChainOutput}
IMPORTANT: Use the above output as input context for this skill's execution.
```
This section is only added for skills at position 2+ in the chain. The first skill in the chain receives no upstream context.
## skillContent Construction
```javascript
// Read only executable skill files and format with consistent headers
const skillMd = Read(`${skill.path}/SKILL.md`);
const phaseFiles = Glob(`${skill.path}/phases/*.md`).map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const specFiles = Glob(`${skill.path}/specs/*.md`).map(f => ({
relativePath: f.replace(skill.path + '/', ''),
content: Read(f)
}));
const skillContent = `
### File: SKILL.md
${skillMd}
${phaseFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n')}
${specFiles.length > 0 ? specFiles.map(f => `### File: ${f.relativePath}\n${f.content}`).join('\n\n') : ''}
`.trim();
```