Files
Claude-Code-Workflow/.claude/skills/investigate/phases/03-hypothesis-testing.md
catlog22 67ff3fe339 feat: add investigate, security-audit, ship skills (Claude + Codex)
- Add 3 new Claude skills: investigate (Iron Law debugging), security-audit
  (OWASP Top 10 + STRIDE), ship (gated release pipeline)
- Port all 3 skills to Codex v4 format under .codex/skills/ using
  Deep Interaction pattern (spawn_agent + assign_task phase transitions)
- Update README/README_CN acknowledgments: credit gstack
  (https://github.com/garrytan/gstack) as inspiration source

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 10:31:13 +08:00

5.3 KiB

Phase 3: Hypothesis Testing

Form hypotheses from evidence and test each one. Enforce the 3-strike escalation rule.

Objective

  • Form a maximum of 3 hypotheses from Phase 1-2 evidence
  • Test each hypothesis with minimal, read-only probes
  • Confirm or reject each hypothesis with concrete evidence
  • Enforce 3-strike rule: STOP and escalate after 3 consecutive test failures

Execution Steps

Step 1: Form Hypotheses

Using evidence from Phase 1 (investigation report) and Phase 2 (pattern analysis), form up to 3 ranked hypotheses:

{
  "hypotheses": [
    {
      "id": "H1",
      "description": "The root cause is X because evidence Y",
      "evidence_supporting": ["evidence item 1", "evidence item 2"],
      "predicted_behavior": "If H1 is correct, then we should observe Z",
      "test_method": "How to verify: read file X line Y, check value Z",
      "confidence": "high|medium|low"
    }
  ]
}

Hypothesis Formation Rules:

  • Each hypothesis must cite at least one piece of evidence from Phase 1-2
  • Each hypothesis must have a testable prediction
  • Rank by confidence (high first)
  • Maximum 3 hypotheses per investigation

Step 2: Test Hypotheses Sequentially

Test each hypothesis starting from highest confidence. Use read-only probes:

Allowed test methods:

  • Read a specific file and check a specific value or condition
  • Grep for a pattern that would confirm or deny the hypothesis
  • Bash to run a specific test or command that reveals the condition
  • Temporarily add a log statement to observe runtime behavior (revert after)

Prohibited during testing:

  • Modifying production code (save that for Phase 4)
  • Changing multiple things at once
  • Running the full test suite (targeted checks only)
// Example hypothesis test
// H1: "Function X receives null because caller Y doesn't check return value"
const evidence = Read({ file_path: "src/caller.ts" })
// Check: Does caller Y use the return value without null check?
// Result: Confirmed / Rejected with specific evidence

Step 3: Record Test Results

For each hypothesis test:

{
  "hypothesis_tests": [
    {
      "id": "H1",
      "test_performed": "Read src/caller.ts:42 - checked null handling",
      "result": "confirmed|rejected|inconclusive",
      "evidence": "specific observation that confirms or rejects",
      "files_checked": ["src/caller.ts:42-55"]
    }
  ]
}

Step 4: 3-Strike Escalation Rule

Track consecutive test failures. A "failure" means the test was inconclusive or the hypothesis was rejected AND no actionable insight was gained.

Strike Counter:
  [H1 rejected, no insight] → Strike 1
  [H2 rejected, no insight] → Strike 2
  [H3 rejected, no insight] → Strike 3 → STOP

Important: A rejected hypothesis that provides useful insight (narrows the search) does NOT count as a strike. Only truly unproductive tests count.

On 3rd Strike — STOP and Escalate:

## ESCALATION: 3-Strike Limit Reached

### Failed Step
- Phase: 3 — Hypothesis Testing
- Step: Hypothesis test #{N}

### Error History
1. Attempt 1: H1 — {description}
   Test: {what was checked}
   Result: {rejected/inconclusive} — {why}
2. Attempt 2: H2 — {description}
   Test: {what was checked}
   Result: {rejected/inconclusive} — {why}
3. Attempt 3: H3 — {description}
   Test: {what was checked}
   Result: {rejected/inconclusive} — {why}

### Current State
- Evidence collected: {summary from Phase 1-2}
- Hypotheses tested: {list}
- Files examined: {list}

### Diagnosis
- Likely root cause area: {best guess based on all evidence}
- Suggested human action: {specific recommendation — e.g., "Add logging to X", "Check runtime config Y", "Reproduce in debugger at Z"}

### Diagnostic Dump
{Full investigation-report.json content}

After escalation, set status to BLOCKED per Completion Status Protocol.

Step 5: Confirm Root Cause

If a hypothesis is confirmed, document the confirmed root cause:

{
  "phase": 3,
  "confirmed_root_cause": {
    "hypothesis_id": "H1",
    "description": "Root cause description with full evidence chain",
    "evidence_chain": [
      "Phase 1: Error message X observed in Y",
      "Phase 2: Same pattern found in 3 other files",
      "Phase 3: H1 confirmed — null check missing at file.ts:42"
    ],
    "affected_code": {
      "file": "path/to/file.ts",
      "line_range": "42-55",
      "function": "functionName"
    }
  }
}

Output

  • Data: hypothesis-tests and confirmed_root_cause added to investigation report (in-memory)
  • Format: JSON structure as defined above

Gate for Phase 4

Phase 4 can ONLY proceed if confirmed_root_cause is present. This is the Iron Law gate.

Outcome Next Step
Root cause confirmed Proceed to Phase 4: Implementation
3-strike escalation STOP, output diagnostic dump, status = BLOCKED
Partial insight Re-form hypotheses with new evidence (stays in Phase 3)

Quality Checks

  • Maximum 3 hypotheses formed, each with cited evidence
  • Each hypothesis tested with a specific, documented probe
  • Test results recorded with concrete evidence
  • 3-strike counter maintained correctly
  • Root cause confirmed with full evidence chain OR escalation triggered

Next Phase

Proceed to Phase 4: Implementation ONLY with confirmed root cause.