feat: add lite-fix command design for bug diagnosis and emergency fixes

Introduces /workflow:lite-fix - a lightweight bug fixing workflow optimized for rapid diagnosis, targeted fixes, and streamlined verification. Command Design: - Three severity modes: Regular (2-4h), Critical (30-60min), Hotfix (15-30min) - Six-phase execution: Diagnosis → Impact → Planning → Verification → Confirmation → Execution - Intelligent code search: cli-explore-agent (regular) → direct search (critical) → minimal (hotfix) - Risk-aware verification: Full test suite → Focused tests → Smoke tests Key Features: - Structured root cause analysis (file:line, reproduction steps, blame info) - Quantitative impact assessment (risk score 0-10, user/business impact) - Multi-strategy fix planning (immediate patch vs comprehensive refactor) - Adaptive branch strategy (feature branch vs hotfix branch from production tag) - Automatic follow-up task generation for hotfixes (tech debt management) - Real-time deployment monitoring with auto-rollback triggers Integration: - Complements /workflow:lite-plan (fix vs feature development) - Reuses /workflow:lite-execute for execution layer - Integrates with /cli:mode:bug-diagnosis for preliminary analysis - Escalation path to /workflow:plan for complex refactors Design Documents: - .claude/commands/workflow/lite-fix.md - Complete command specification - LITE_FIX_DESIGN.md - Architecture design and decision records Addresses: PLANNING_GAP_ANALYSIS.md Scenario #8 (Emergency Fix) Expected Impact: - Reduce bug fix time by 50-70% - Improve diagnosis accuracy to 85%+ - Reduce production hotfix risks - Systematize technical debt from quick fixes
2026-02-05 01:50:27 +08:00 · 2025-11-20 09:21:26 +00:00
parent 2fb1015038
commit 38f2355573
2 changed files with 1710 additions and 0 deletions
--- a/.claude/commands/workflow/lite-fix.md
+++ b/.claude/commands/workflow/lite-fix.md
--- a/LITE_FIX_DESIGN.md
+++ b/LITE_FIX_DESIGN.md
@@ -0,0 +1,549 @@
+# Lite-Fix Command Design Document
+
+**Date**: 2025-11-20
+**Version**: 1.0.0
+**Status**: Design Proposal
+**Related**: PLANNING_GAP_ANALYSIS.md (Scenario #8: 紧急修复场景)
+
+---
+
+## 设计概述
+
+`/workflow:lite-fix` 是一个轻量级的bug诊断和修复工作流命令，填补了当前planning系统在紧急修复场景的空白。设计参考了 `/workflow:lite-plan` 的成功模式，针对bug修复场景进行优化。
+
+### 核心设计理念
+
+1. **快速响应** - 支持15分钟到4小时的修复周期
+2. **风险感知** - 根据严重程度调整流程复杂度
+3. **渐进式验证** - 从smoke test到全量测试的灵活策略
+4. **自动化跟进** - Hotfix模式自动生成完善任务
+
+---
+
+## 设计对比：lite-fix vs lite-plan
+
+| 维度 | lite-plan | lite-fix | 设计理由 |
+|------|-----------|----------|----------|
+| **目标场景** | 新功能开发 | Bug修复 | 不同的开发意图 |
+| **时间预算** | 1-6小时 | 15分钟-4小时 | Bug修复更紧迫 |
+| **探索阶段** | 可选（`-e` flag） | 必需但可简化 | Bug需要诊断 |
+| **输出类型** | 实现计划 | 诊断+修复计划 | Bug需要根因分析 |
+| **验证策略** | 完整测试套件 | 分级验证（Smoke/Focused/Full） | 风险vs速度权衡 |
+| **分支策略** | Feature分支 | Feature/Hotfix分支 | 生产环境需要特殊处理 |
+| **跟进机制** | 无 | Hotfix自动生成跟进任务 | 技术债务管理 |
+
+---
+
+## 三种严重度模式设计
+
+### Mode 1: Regular (默认)
+
+**适用场景**：
+- 非阻塞性bug
+- 影响<20%用户
+- 有充足时间（2-4小时）
+
+**流程特点**：
+```
+完整诊断 → 多策略评估 → 全量测试 → 标准分支
+```
+
+**示例用例**：
+```bash
+/workflow:lite-fix "用户头像上传失败，返回413错误"
+```
+
+### Mode 2: Critical (`--critical`)
+
+**适用场景**：
+- 影响核心功能
+- 影响20-50%用户
+- 需要1小时内修复
+
+**流程简化**：
+```
+聚焦诊断 → 单一最佳策略 → 关键场景测试 → 快速分支
+```
+
+**示例用例**：
+```bash
+/workflow:lite-fix --critical "购物车结算时随机丢失商品"
+```
+
+### Mode 3: Hotfix (`--hotfix`)
+
+**适用场景**：
+- 生产完全故障
+- 影响100%用户或业务中断
+- 需要15-30分钟修复
+
+**流程最简**：
+```
+最小诊断 → 外科手术式修复 → Smoke测试 → Hotfix分支 → 自动跟进任务
+```
+
+**示例用例**：
+```bash
+/workflow:lite-fix --hotfix --incident INC-2024-1015 "支付网关5xx错误"
+```
+
+---
+
+## 六阶段执行流程设计
+
+### Phase 1: Diagnosis & Root Cause (诊断)
+
+**设计亮点**：
+- **智能搜索策略**：Regular使用cli-explore-agent，Critical使用直接搜索，Hotfix使用已知信息
+- **结构化输出**：root_cause对象包含file、line_range、issue、introduced_by
+- **可复现性验证**：输出reproduction_steps供验证
+
+**技术实现**：
+```javascript
+if (mode === "regular") {
+  // 使用cli-explore-agent深度探索
+  Task(subagent_type="cli-explore-agent", ...)
+} else if (mode === "critical") {
+  // 直接使用grep + git blame
+  Bash("grep -r '${error}' src/ | head -10")
+} else {
+  // 假设已知问题，跳过探索
+  Read(suspected_file)
+}
+```
+
+### Phase 2: Impact Assessment (影响评估)
+
+**设计亮点**：
+- **量化风险评分**：`risk_score = user_impact×0.4 + system_risk×0.3 + business_impact×0.3`
+- **自动严重度建议**：根据评分建议使用`--critical`或`--hotfix`
+- **业务影响分析**：包括revenue、reputation、SLA breach
+
+**输出示例**：
+```markdown
+## Impact Assessment
+**Risk Level**: HIGH (7.1/10)
+**Affected Users**: ~5000 (100%)
+**Business Impact**: Revenue (Medium), Reputation (High), SLA Breached
+**Recommended Severity**: --critical flag suggested
+```
+
+### Phase 3: Fix Planning (修复规划)
+
+**设计亮点**：
+- **多策略生成**（Regular）：immediate_patch vs comprehensive_refactor
+- **单一最佳策略**（Critical/Hotfix）：surgical_fix
+- **复杂度自适应**：低复杂度用Claude直接规划，中等复杂度用cli-lite-planning-agent
+
+**升级路径**：
+```javascript
+if (complexity > threshold) {
+  suggest("/workflow:plan --mode bugfix")
+}
+```
+
+### Phase 4: Verification Strategy (验证策略)
+
+**三级验证设计**：
+
+| Level | Test Scope | Duration | Pass Criteria |
+|-------|------------|----------|---------------|
+| Smoke | 核心路径 | 2-5分钟 | 无核心功能回归 |
+| Focused | 受影响模块 | 5-10分钟 | 关键场景通过 |
+| Comprehensive | 完整套件 | 10-20分钟 | 全部测试通过 |
+
+**分支策略差异**：
+```javascript
+if (mode === "hotfix") {
+  branch = {
+    type: "hotfix_branch",
+    base: "production_tag_v2.3.1",  // ⚠️ 从生产tag创建
+    merge_target: ["main", "production"]  // 双向合并
+  }
+}
+```
+
+### Phase 5: User Confirmation (用户确认)
+
+**多维度确认设计**：
+
+**Regular**: 4维度
+1. Fix strategy (Proceed/Modify/Escalate)
+2. Execution method (Agent/CLI/Manual)
+3. Verification level (Full/Focused/Smoke)
+4. Code review (Gemini/Skip)
+
+**Critical**: 3维度（跳过code review）
+
+**Hotfix**: 2维度（最小确认）
+1. Deploy confirmation (Deploy/Stage First/Abort)
+2. Post-deployment monitoring (Real-time/Passive)
+
+### Phase 6: Execution & Follow-up (执行和跟进)
+
+**设计亮点**：
+- **统一执行接口**：通过`/workflow:lite-execute --in-memory --mode bugfix`执行
+- **自动跟进任务**（Hotfix专属）：
+  ```json
+  [
+    {"id": "FOLLOWUP-comprehensive", "title": "完善修复", "due": "3天"},
+    {"id": "FOLLOWUP-postmortem", "title": "事后分析", "due": "1周"}
+  ]
+  ```
+- **实时监控**（可选）：15分钟监控窗口，自动回滚触发器
+
+---
+
+## 与现有系统集成
+
+### 1. 命令集成路径
+
+```mermaid
+graph TD
+    A[Bug发现] --> B{严重程度?}
+    B -->|不确定| C[/cli:mode:bug-diagnosis]
+    C --> D[/workflow:lite-fix]
+
+    B -->|一般| E[/workflow:lite-fix]
+    B -->|紧急| F[/workflow:lite-fix --critical]
+    B -->|生产故障| G[/workflow:lite-fix --hotfix]
+
+    E --> H{复杂度?}
+    F --> H
+    G --> I[lite-execute]
+
+    H -->|简单| I
+    H -->|复杂| J[建议升级到 /workflow:plan]
+
+    I --> K[修复完成]
+    K --> L[/workflow:review --type quality]
+```
+
+### 2. 数据流设计
+
+**输入**：
+```javascript
+{
+  bug_description: string,
+  severity_flags: "--critical" | "--hotfix" | null,
+  incident_id: string | null
+}
+```
+
+**内部数据结构**：
+- `diagnosisContext`: 根因分析结果
+- `impactContext`: 影响评估结果
+- `fixPlan`: 修复计划对象
+- `executionContext`: 传递给lite-execute的上下文
+
+**输出**：
+```javascript
+{
+  // In-memory (传递给lite-execute)
+  executionContext: {...},
+
+  // Optional: Persistent JSON
+  `.workflow/lite-fixes/BUGFIX-${timestamp}.json`,
+
+  // Hotfix专属：Follow-up tasks
+  `.workflow/lite-fixes/BUGFIX-${timestamp}-followup.json`
+}
+```
+
+### 3. 与lite-execute配合
+
+**lite-fix职责**：
+- ✅ 诊断和规划
+- ✅ 影响评估
+- ✅ 策略选择
+- ✅ 用户确认
+
+**lite-execute职责**：
+- ✅ 代码实现
+- ✅ 测试执行
+- ✅ 分支操作
+- ✅ 部署监控
+
+**交接机制**：
+```javascript
+// lite-fix准备executionContext
+executionContext = {
+  mode: "bugfix",
+  severity: "hotfix",
+  planObject: {...},
+  diagnosisContext: {...},
+  verificationStrategy: {...},
+  branchStrategy: {...}
+}
+
+// 调用lite-execute
+SlashCommand("/workflow:lite-execute --in-memory --mode bugfix")
+```
+
+---
+
+## 架构扩展点
+
+### 1. Enhanced Task JSON Schema扩展
+
+**新增scenario类型**：
+```json
+{
+  "scenario_type": "bugfix",
+  "scenario_config": {
+    "severity": "regular|critical|hotfix",
+    "root_cause": {
+      "file": "...",
+      "line_range": "...",
+      "issue": "..."
+    },
+    "impact_assessment": {
+      "risk_score": 7.1,
+      "affected_users_count": 5000
+    }
+  }
+}
+```
+
+### 2. workflow-session.json扩展（如果使用session模式）
+
+```json
+{
+  "session_id": "WFS-bugfix-payment",
+  "type": "bugfix",
+  "severity": "critical",
+  "incident_id": "INC-2024-1015",
+  "bugfixes": [
+    {
+      "id": "BUGFIX-001",
+      "status": "completed",
+      "follow_up_tasks": ["FOLLOWUP-001", "FOLLOWUP-002"]
+    }
+  ]
+}
+```
+
+### 3. 诊断缓存机制（高级特性）
+
+**设计目的**：加速相似bug的诊断
+
+```javascript
+cache_key = hash(bug_keywords + recent_changes_hash)
+cache_path = `.workflow/lite-fixes/diagnosis-cache/${cache_key}.json`
+
+if (cache_exists && cache_age < 1_week) {
+  diagnosis = load_from_cache()
+  console.log("Using cached diagnosis (similar issue found)")
+}
+```
+
+---
+
+## 质量保证机制
+
+### 1. 质量门禁
+
+**执行前检查**：
+- [ ] 根因识别置信度>80%
+- [ ] 影响范围明确定义
+- [ ] 修复策略已审查批准
+- [ ] 验证计划与风险匹配
+- [ ] 分支策略符合严重度
+
+**Hotfix专属门禁**：
+- [ ] 事故工单已创建并关联
+- [ ] 事故指挥官批准
+- [ ] 回滚计划已文档化
+- [ ] 跟进任务已生成
+- [ ] 部署后监控已配置
+
+### 2. 错误处理策略
+
+| 错误 | 原因 | 解决方案 |
+|------|------|---------|
+| 根因未找到 | 搜索范围不足 | 扩大搜索或升级到/workflow:plan |
+| 复现失败 | 过期bug或环境问题 | 验证环境，请求更新复现步骤 |
+| 多个潜在原因 | 复杂交互 | 使用/cli:discuss-plan多模型分析 |
+| 修复过于复杂 | 需要重构 | 建议/workflow:plan --mode refactor |
+| Hotfix验证失败 | Smoke测试不足 | 添加关键测试或降级到critical模式 |
+
+---
+
+## 实现优先级和路线图
+
+### Phase 1: 核心功能实现（Sprint 1）
+
+**必需功能**：
+- [x] 命令文档完成（本文档）
+- [ ] 六阶段流程实现
+- [ ] 三种严重度模式支持
+- [ ] 基础诊断逻辑
+- [ ] 与lite-execute集成
+
+**预计工作量**：5-8天
+
+### Phase 2: 高级特性（Sprint 2）
+
+**增强功能**：
+- [ ] 诊断缓存机制
+- [ ] 自动严重度检测
+- [ ] Hotfix分支管理
+- [ ] 实时监控集成
+- [ ] 跟进任务自动生成
+
+**预计工作量**：3-5天
+
+### Phase 3: 优化和完善（Sprint 3）
+
+**优化项**：
+- [ ] 性能优化（诊断速度）
+- [ ] 错误处理完善
+- [ ] 文档和示例补充
+- [ ] 用户反馈迭代
+
+**预计工作量**：2-3天
+
+---
+
+## 成功度量指标
+
+### 1. 使用指标
+
+- **采用率**：lite-fix使用次数 / 总bug修复次数
+- **目标**：>60%的常规bug使用lite-fix
+
+### 2. 效率指标
+
+- **Regular模式**：平均修复时间<3小时（vs 手动4-6小时）
+- **Critical模式**：平均修复时间<1小时（vs 手动2-3小时）
+- **Hotfix模式**：平均修复时间<30分钟（vs 手动1-2小时）
+
+### 3. 质量指标
+
+- **诊断准确率**：根因识别准确率>85%
+- **修复成功率**：首次修复成功率>90%
+- **回归率**：引入新bug的比例<5%
+
+### 4. 跟进完成率（Hotfix）
+
+- **跟进任务完成率**：>80%的跟进任务在deadline前完成
+- **技术债务偿还**：Hotfix的完善修复在3天内完成率>70%
+
+---
+
+## 风险和缓解措施
+
+### 风险1: Hotfix误用导致生产问题
+
+**缓解措施**：
+1. 严格的质量门禁（需要事故指挥官批准）
+2. 强制回滚计划文档化
+3. 实时监控自动回滚触发器
+4. 强制生成跟进任务
+
+### 风险2: 诊断准确率不足
+
+**缓解措施**：
+1. 诊断置信度评分（<80%建议升级到/workflow:plan）
+2. 提供多假设诊断（当不确定时）
+3. 集成/cli:discuss-plan用于复杂case
+
+### 风险3: 用户跳过必要验证
+
+**缓解措施**：
+1. 根据严重度强制最低验证级别
+2. 显示风险警告（跳过测试的后果）
+3. Hotfix模式禁止跳过smoke测试
+
+---
+
+## 与PLANNING_GAP_ANALYSIS的对应关系
+
+本设计是对 `PLANNING_GAP_ANALYSIS.md` 中 **场景8: 紧急修复场景** 的完整实现方案。
+
+**Gap覆盖情况**：
+
+| Gap项 | 覆盖程度 | 实现方式 |
+|-------|---------|---------|
+| 流程简化 | ✅ 100% | 三种严重度模式 |
+| 快速验证 | ✅ 100% | 分级验证策略 |
+| Hotfix分支管理 | ✅ 100% | branchStrategy配置 |
+| 事后补充完整修复 | ✅ 100% | 自动跟进任务生成 |
+
+**额外增强**：
+- ✅ 诊断缓存机制（未在原gap中提及）
+- ✅ 实时监控集成（超出原需求）
+- ✅ 自动严重度建议（智能化增强）
+
+---
+
+## 设计决策记录
+
+### ADR-001: 为什么不复用/workflow:plan?
+
+**决策**：创建独立的lite-fix命令
+
+**理由**：
+1. Bug修复的核心是"诊断"，而plan的核心是"设计"
+2. Bug修复需要风险分级（Regular/Critical/Hotfix），plan不需要
+3. Bug修复需要Hotfix分支管理，plan使用标准feature分支
+4. Bug修复需要跟进任务机制，plan假设一次性完成
+
+**替代方案被拒绝**：`/workflow:plan --mode bugfix`
+- 问题：plan流程太重，不适合15分钟hotfix场景
+
+### ADR-002: 为什么分三种严重度而不是连续评分?
+
+**决策**：使用离散的Regular/Critical/Hotfix模式
+
+**理由**：
+1. 清晰的决策点（用户容易选择）
+2. 每个模式有明确的流程差异
+3. 避免"评分8.5应该用什么流程"的模糊性
+
+**替代方案被拒绝**：0-10连续评分自动选择流程
+- 问题：评分边界模糊，用户难以理解为什么某个分数用某个流程
+
+### ADR-003: 为什么诊断是必需而非可选?
+
+**决策**：Phase 1诊断在所有模式下都是必需的（但复杂度可调整）
+
+**理由**：
+1. 即使是已知bug，也需要验证复现路径
+2. 诊断输出对后续修复质量至关重要
+3. 跳过诊断容易导致修错问题（修了A却没修B）
+
+**替代方案被拒绝**：Hotfix模式完全跳过诊断
+- 问题：增加误修风险，事后难以生成postmortem
+
+---
+
+## 总结
+
+`/workflow:lite-fix` 是对当前planning系统的重要补充，专注于bug修复这一特定场景。设计充分借鉴了lite-plan的成功经验，同时针对bug修复的特殊需求进行了优化：
+
+**核心优势**：
+1. ⚡ **快速响应**：15分钟到4小时的修复周期
+2. 🎯 **风险感知**：三种严重度模式适配不同紧急程度
+3. 🔍 **智能诊断**：结构化根因分析
+4. 🛡️ **质量保证**：分级验证+强制门禁
+5. 📋 **技术债务管理**：Hotfix自动生成完善任务
+
+**与现有系统协同**：
+- 与 `/workflow:lite-plan` 形成"修复-开发"双子命令
+- 复用 `/workflow:lite-execute` 执行层
+- 集成 `/cli:mode:bug-diagnosis` 诊断能力
+- 支持升级到 `/workflow:plan` 处理复杂场景
+
+**预期影响**：
+- 减少bug修复时间50-70%
+- 提升诊断准确率到85%+
+- 减少生产hotfix风险
+- 系统化管理技术债务
+
+---
+
+**文档版本**: 1.0.0
+**作者**: Claude (Sonnet 4.5)
+**审阅状态**: 待审阅
+**实现状态**: 设计阶段（待开发）