Files
Claude-Code-Workflow/.claude/skills/skill-iter-tune/specs/evaluation-criteria.md
catlog22 9fb13ed6b0 Implement phases for skill iteration tuning: Evaluation, Improvement, and Reporting
- Added Phase 3: Evaluate Quality with steps for preparing context, constructing evaluation prompts, executing evaluation via CLI, parsing scores, and checking termination conditions.
- Introduced Phase 4: Apply Improvements to implement targeted changes based on evaluation suggestions, including agent execution and change documentation.
- Created Phase 5: Final Report to generate a comprehensive report of the iteration process, including score progression and remaining weaknesses.
- Established evaluation criteria in a new document to guide the evaluation process.
- Developed templates for evaluation and execution prompts to standardize input for the evaluation and execution phases.
2026-03-10 21:42:58 +08:00

2.6 KiB
Raw Blame History

Evaluation Criteria

Skill 质量评估标准,由 Phase 03 (Evaluate) 引用。Gemini 按此标准对 skill 产出物进行多维度评分。

Dimensions

Dimension Weight ID Description
Clarity 0.20 clarity 指令清晰无歧义结构良好易于遵循。Phase 文件有明确的 Step 划分、输入输出说明
Completeness 0.25 completeness 覆盖所有必要阶段、边界情况、错误处理。没有遗漏关键执行路径
Correctness 0.25 correctness 逻辑正确数据流一致Phase 间无矛盾。State schema 与实际使用匹配
Effectiveness 0.20 effectiveness 在给定测试场景下能产出高质量输出。产物满足用户需求和成功标准
Efficiency 0.10 efficiency 无冗余内容,上下文使用合理,不浪费 token。Phase 职责清晰无重叠

Scoring Guide

Range Level Description
90-100 Excellent 生产级别,几乎无改进空间
80-89 Good 可投入使用,仅需微调
70-79 Adequate 功能可用,有明显可改进区域
60-69 Needs Work 存在影响产出质量的显著问题
0-59 Poor 结构或逻辑存在根本性问题

Composite Score Calculation

composite = sum(dimension.score * dimension.weight)

Output JSON Schema

{
  "composite_score": 75,
  "dimensions": [
    { "name": "Clarity", "id": "clarity", "score": 80, "weight": 0.20, "feedback": "..." },
    { "name": "Completeness", "id": "completeness", "score": 70, "weight": 0.25, "feedback": "..." },
    { "name": "Correctness", "id": "correctness", "score": 78, "weight": 0.25, "feedback": "..." },
    { "name": "Effectiveness", "id": "effectiveness", "score": 72, "weight": 0.20, "feedback": "..." },
    { "name": "Efficiency", "id": "efficiency", "score": 85, "weight": 0.10, "feedback": "..." }
  ],
  "strengths": ["...", "...", "..."],
  "weaknesses": ["...", "...", "..."],
  "suggestions": [
    {
      "priority": "high",
      "target_file": "phases/02-execute.md",
      "description": "Add explicit error handling for CLI timeout",
      "rationale": "Current phase has no recovery path when CLI execution exceeds timeout",
      "code_snippet": "optional suggested replacement code"
    }
  ]
}

Evaluation Focus by Iteration

Iteration Primary Focus
1 全面评估,建立 baseline
2-3 重点关注上一轮 weaknesses 是否改善,避免重复已解决的问题
4+ 精细化改进,关注 Effectiveness 和 Efficiency