Implement phases for skill iteration tuning: Evaluation, Improvement, and Reporting

- Added Phase 3: Evaluate Quality with steps for preparing context, constructing evaluation prompts, executing evaluation via CLI, parsing scores, and checking termination conditions. - Introduced Phase 4: Apply Improvements to implement targeted changes based on evaluation suggestions, including agent execution and change documentation. - Created Phase 5: Final Report to generate a comprehensive report of the iteration process, including score progression and remaining weaknesses. - Established evaluation criteria in a new document to guide the evaluation process. - Developed templates for evaluation and execution prompts to standardize input for the evaluation and execution phases.
2026-03-11 17:21:03 +08:00 · 2026-03-10 21:42:58 +08:00
parent b4ad8c7b80
commit 9fb13ed6b0
10 changed files with 1777 additions and 0 deletions
--- a/.claude/skills/skill-iter-tune/specs/evaluation-criteria.md
+++ b/.claude/skills/skill-iter-tune/specs/evaluation-criteria.md
@@ -0,0 +1,63 @@
+# Evaluation Criteria
+
+Skill 质量评估标准，由 Phase 03 (Evaluate) 引用。Gemini 按此标准对 skill 产出物进行多维度评分。
+
+## Dimensions
+
+| Dimension | Weight | ID | Description |
+|-----------|--------|----|-------------|
+| Clarity | 0.20 | clarity | 指令清晰无歧义，结构良好，易于遵循。Phase 文件有明确的 Step 划分、输入输出说明 |
+| Completeness | 0.25 | completeness | 覆盖所有必要阶段、边界情况、错误处理。没有遗漏关键执行路径 |
+| Correctness | 0.25 | correctness | 逻辑正确，数据流一致，Phase 间无矛盾。State schema 与实际使用匹配 |
+| Effectiveness | 0.20 | effectiveness | 在给定测试场景下能产出高质量输出。产物满足用户需求和成功标准 |
+| Efficiency | 0.10 | efficiency | 无冗余内容，上下文使用合理，不浪费 token。Phase 职责清晰无重叠 |
+
+## Scoring Guide
+
+| Range | Level | Description |
+|-------|-------|-------------|
+| 90-100 | Excellent | 生产级别，几乎无改进空间 |
+| 80-89 | Good | 可投入使用，仅需微调 |
+| 70-79 | Adequate | 功能可用，有明显可改进区域 |
+| 60-69 | Needs Work | 存在影响产出质量的显著问题 |
+| 0-59 | Poor | 结构或逻辑存在根本性问题 |
+
+## Composite Score Calculation
+
+```
+composite = sum(dimension.score * dimension.weight)
+```
+
+## Output JSON Schema
+
+```json
+{
+  "composite_score": 75,
+  "dimensions": [
+    { "name": "Clarity", "id": "clarity", "score": 80, "weight": 0.20, "feedback": "..." },
+    { "name": "Completeness", "id": "completeness", "score": 70, "weight": 0.25, "feedback": "..." },
+    { "name": "Correctness", "id": "correctness", "score": 78, "weight": 0.25, "feedback": "..." },
+    { "name": "Effectiveness", "id": "effectiveness", "score": 72, "weight": 0.20, "feedback": "..." },
+    { "name": "Efficiency", "id": "efficiency", "score": 85, "weight": 0.10, "feedback": "..." }
+  ],
+  "strengths": ["...", "...", "..."],
+  "weaknesses": ["...", "...", "..."],
+  "suggestions": [
+    {
+      "priority": "high",
+      "target_file": "phases/02-execute.md",
+      "description": "Add explicit error handling for CLI timeout",
+      "rationale": "Current phase has no recovery path when CLI execution exceeds timeout",
+      "code_snippet": "optional suggested replacement code"
+    }
+  ]
+}
+```
+
+## Evaluation Focus by Iteration
+
+| Iteration | Primary Focus |
+|-----------|--------------|
+| 1 | 全面评估，建立 baseline |
+| 2-3 | 重点关注上一轮 weaknesses 是否改善，避免重复已解决的问题 |
+| 4+ | 精细化改进，关注 Effectiveness 和 Efficiency |