Files
Claude-Code-Workflow/.codex/skills/investigate/phases/03-hypothesis-testing.md
catlog22 67ff3fe339 feat: add investigate, security-audit, ship skills (Claude + Codex)
- Add 3 new Claude skills: investigate (Iron Law debugging), security-audit
  (OWASP Top 10 + STRIDE), ship (gated release pipeline)
- Port all 3 skills to Codex v4 format under .codex/skills/ using
  Deep Interaction pattern (spawn_agent + assign_task phase transitions)
- Update README/README_CN acknowledgments: credit gstack
  (https://github.com/garrytan/gstack) as inspiration source

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 10:31:13 +08:00

7.0 KiB

Phase 3: Hypothesis Testing

COMPACT PROTECTION: This is a core execution phase. If context compression has occurred and this file is only a summary, MUST Read this file again before executing any Step. Do not execute from memory.

Form hypotheses from evidence and test each one. Enforce the 3-strike escalation rule.

Objective

  • Form a maximum of 3 hypotheses from Phase 1-2 evidence
  • Test each hypothesis with minimal, read-only probes
  • Confirm or reject each hypothesis with concrete evidence
  • Enforce 3-strike rule: STOP and escalate after 3 consecutive unproductive test failures

Input

Source Required Description
investigation-report (phases 1-2) Yes Evidence, affected files, pattern analysis, initial suspects
assign_task message Yes Phase 3 instruction

Execution Steps

Step 1: Form Hypotheses

Using evidence from Phase 1 (investigation report) and Phase 2 (pattern analysis), form up to 3 ranked hypotheses:

Hypothesis formation rules:

  • Each hypothesis must cite at least one piece of evidence from Phase 1-2
  • Each hypothesis must have a testable prediction
  • Rank by confidence (high first)
  • Maximum 3 hypotheses per investigation

Assemble hypotheses in memory:

hypotheses = [
  {
    id: "H1",
    description: "The root cause is <X> because evidence <Y>",
    evidence_supporting: ["<evidence item 1>", "<evidence item 2>"],
    predicted_behavior: "If H1 is correct, then we should observe <Z>",
    test_method: "How to verify: read file <X> line <Y>, check value <Z>",
    confidence: "high|medium|low"
  }
]

Initialize strike counter: 0


Step 2: Test Hypotheses Sequentially

Test each hypothesis starting from highest confidence (H1 first). Use read-only probes only during testing.

Allowed test methods:

Method Usage
Read a specific file Check a specific value, condition, or code pattern
Grep for a pattern Confirm or deny the presence of a condition
Bash targeted test Run a specific test that reveals the condition
Temporary log statement Add a log to observe runtime behavior; MUST revert after

Prohibited during hypothesis testing:

  • Modifying production code (save for Phase 4)
  • Changing multiple things at once
  • Running the full test suite (targeted checks only)

Step 3: Record Test Results

For each hypothesis test, record:

hypothesis_test = {
  id: "H1",
  test_performed: "<what was checked, e.g.: Read src/caller.ts:42 — checked null handling>",
  result: "confirmed|rejected|inconclusive",
  evidence: "<specific observation that confirms or rejects>",
  files_checked: ["<src/caller.ts:42-55>"]
}

Step 4: 3-Strike Escalation Rule

Track consecutive unproductive test failures. After each hypothesis test, evaluate:

Strike evaluation:

Test result New insight gained Strike action
confirmed CONFIRM root cause, end testing
rejected Yes — narrows search or reveals new cause No strike (productive rejection)
rejected No — no actionable insight +1 strike
inconclusive Yes — identifies new area No strike (productive)
inconclusive No — no narrowing +1 strike

Strike counter tracking:

Strike count Action
1 Continue to next hypothesis
2 Continue to next hypothesis
3 STOP — output escalation block immediately

On 3rd Strike — output this escalation block verbatim and halt:

## ESCALATION: 3-Strike Limit Reached

### Failed Step
- Phase: 3 — Hypothesis Testing
- Step: Hypothesis test #<N>

### Error History
1. Attempt 1: <H1 description>
   Test: <what was checked>
   Result: <rejected/inconclusive> — <why>
2. Attempt 2: <H2 description>
   Test: <what was checked>
   Result: <rejected/inconclusive> — <why>
3. Attempt 3: <H3 description>
   Test: <what was checked>
   Result: <rejected/inconclusive> — <why>

### Current State
- Evidence collected: <summary from Phase 1-2>
- Hypotheses tested: <list>
- Files examined: <list>

### Diagnosis
- Likely root cause area: <best guess based on all evidence>
- Suggested human action: <specific recommendation — e.g., "Add logging to X", "Check runtime config Y", "Reproduce in debugger at Z">

### Diagnostic Dump
<Full investigation-report content from all phases>

STATUS: BLOCKED

After outputting escalation: set status BLOCKED. Do not proceed to Phase 4.


Step 5: Confirm Root Cause

If a hypothesis is confirmed, document the confirmed root cause:

confirmed_root_cause = {
  hypothesis_id: "H1",
  description: "<Root cause description with full technical detail>",
  evidence_chain: [
    "Phase 1: <Error message X observed in Y>",
    "Phase 2: <Same pattern found in N other files>",
    "Phase 3: H1 confirmed — <specific condition at file.ts:42>"
  ],
  affected_code: {
    file: "<path/to/file.ts>",
    line_range: "<42-55>",
    function: "<functionName>"
  }
}

Add hypothesis_tests and confirmed_root_cause to investigation-report in memory.

Output Phase 3 results and await assign_task for Phase 4.


Output

Artifact Format Description
investigation-report (phase 3) In-memory JSON Phases 1-2 fields + hypothesis_tests + confirmed_root_cause
Phase 3 summary or escalation block Structured text output Either confirmed root cause or BLOCKED escalation

Success Criteria

Criterion Validation Method
Maximum 3 hypotheses formed Count of hypotheses array
Each hypothesis cites evidence evidence_supporting non-empty for each
Each hypothesis tested with documented probe test_performed field populated for each
Strike counter maintained correctly Count of unproductive consecutive failures
Root cause confirmed with evidence chain OR escalation triggered confirmed_root_cause present OR BLOCKED output

Error Handling

Scenario Resolution
Evidence insufficient to form 3 hypotheses Form as many as evidence supports (minimum 1), proceed
Partial insight from rejected hypothesis Do not count as strike; re-form or refine remaining hypotheses with new insight
All 3 hypotheses confirmed simultaneously Use highest-confidence confirmed one as root cause
Hypothesis test requires production change Prohibited — use static analysis or targeted read-only probe instead

Gate for Phase 4

Phase 4 can ONLY proceed if confirmed_root_cause is present. This is the Iron Law gate.

Outcome Next Step
Root cause confirmed -> Phase 4: Implementation
3-strike escalation triggered STOP — output diagnostic dump — STATUS: BLOCKED
Partial insight, re-forming hypotheses Stay in Phase 3, re-test with refined hypotheses

Next Phase

-> Phase 4: Implementation ONLY with confirmed root cause.