Claude-Code-Workflow/.claude/skills/skill-iter-tune/specs/evaluation-criteria.md

# Evaluation Criteria

Skill 质量评估标准，由 Phase 03 (Evaluate) 引用。Gemini 按此标准对 skill 产出物进行多维度评分。

## Dimensions

| Dimension | Weight | ID | Description |
|-----------|--------|----|-------------|
| Clarity | 0.20 | clarity | 指令清晰无歧义，结构良好，易于遵循。Phase 文件有明确的 Step 划分、输入输出说明 |
| Completeness | 0.25 | completeness | 覆盖所有必要阶段、边界情况、错误处理。没有遗漏关键执行路径 |
| Correctness | 0.25 | correctness | 逻辑正确，数据流一致，Phase 间无矛盾。State schema 与实际使用匹配 |
| Effectiveness | 0.20 | effectiveness | 在给定测试场景下能产出高质量输出。产物满足用户需求和成功标准 |
| Efficiency | 0.10 | efficiency | 无冗余内容，上下文使用合理，不浪费 token。Phase 职责清晰无重叠 |

## Scoring Guide

| Range | Level | Description |
|-------|-------|-------------|
| 90-100 | Excellent | 生产级别，几乎无改进空间 |
| 80-89 | Good | 可投入使用，仅需微调 |
| 70-79 | Adequate | 功能可用，有明显可改进区域 |
| 60-69 | Needs Work | 存在影响产出质量的显著问题 |
| 0-59 | Poor | 结构或逻辑存在根本性问题 |

## Composite Score Calculation

```
composite = sum(dimension.score * dimension.weight)
```

## Output JSON Schema

```json
{
  "composite_score": 75,
  "dimensions": [
    { "name": "Clarity", "id": "clarity", "score": 80, "weight": 0.20, "feedback": "..." },
    { "name": "Completeness", "id": "completeness", "score": 70, "weight": 0.25, "feedback": "..." },
    { "name": "Correctness", "id": "correctness", "score": 78, "weight": 0.25, "feedback": "..." },
    { "name": "Effectiveness", "id": "effectiveness", "score": 72, "weight": 0.20, "feedback": "..." },
    { "name": "Efficiency", "id": "efficiency", "score": 85, "weight": 0.10, "feedback": "..." }
  ],
  "strengths": ["...", "...", "..."],
  "weaknesses": ["...", "...", "..."],
  "suggestions": [
    {
      "priority": "high",
      "target_file": "phases/02-execute.md",
      "description": "Add explicit error handling for CLI timeout",
      "rationale": "Current phase has no recovery path when CLI execution exceeds timeout",
      "code_snippet": "optional suggested replacement code"
    }
  ]
}
```

## Evaluation Focus by Iteration

| Iteration | Primary Focus |
|-----------|--------------|
| 1 | 全面评估，建立 baseline |
| 2-3 | 重点关注上一轮 weaknesses 是否改善，避免重复已解决的问题 |
| 4+ | 精细化改进，关注 Effectiveness 和 Efficiency |