diff --git a/.ccw/workflows/context-tools.md b/.ccw/workflows/context-tools.md deleted file mode 100644 index 7db3e70f..00000000 --- a/.ccw/workflows/context-tools.md +++ /dev/null @@ -1,76 +0,0 @@ -## Context Acquisition (MCP Tools Priority) - -**For task context gathering and analysis, ALWAYS prefer MCP tools**: - -1. **mcp__ace-tool__search_context** - HIGHEST PRIORITY for code discovery - - Semantic search with real-time codebase index - - Use for: finding implementations, understanding architecture, locating patterns - - Example: `mcp__ace-tool__search_context(project_root_path="/path", query="authentication logic")` - -2. **smart_search** - Fallback for structured search - - Use `smart_search(query="...")` for keyword/regex search - - Use `smart_search(action="find_files", pattern="*.ts")` for file discovery - - Supports modes: `auto`, `hybrid`, `exact`, `ripgrep` - -3. **read_file** - Batch file reading - - Read multiple files in parallel: `read_file(path="file1.ts")`, `read_file(path="file2.ts")` - - Supports glob patterns: `read_file(path="src/**/*.config.ts")` - -**Priority Order**: -``` -ACE search_context (semantic) → smart_search (structured) → read_file (batch read) → shell commands (fallback) -``` - -**NEVER** use shell commands (`cat`, `find`, `grep`) when MCP tools are available. -### read_file - Read File Contents - -**When**: Read files found by smart_search - -**How**: -```javascript -read_file(path="/path/to/file.ts") // Single file -read_file(path="/src/**/*.config.ts") // Pattern matching -``` - ---- - -### edit_file - Modify Files - -**When**: Built-in Edit tool fails or need advanced features - -**How**: -```javascript -edit_file(path="/file.ts", old_string="...", new_string="...", mode="update") -edit_file(path="/file.ts", line=10, content="...", mode="insert_after") -``` - -**Modes**: `update` (replace text), `insert_after`, `insert_before`, `delete_line` - ---- - -### write_file - Create/Overwrite Files - -**When**: Create new files or completely replace content - -**How**: -```javascript -write_file(path="/new-file.ts", content="...") -``` - ---- - -### Exa - External Search - -**When**: Find documentation/examples outside codebase - -**How**: -```javascript -mcp__exa__search(query="React hooks 2025 documentation") -mcp__exa__search(query="FastAPI auth example", numResults=10) -mcp__exa__search(query="latest API docs", livecrawl="always") -``` - -**Parameters**: -- `query` (required): Search query string -- `numResults` (optional): Number of results to return (default: 5) -- `livecrawl` (optional): `"always"` or `"fallback"` for live crawling diff --git a/.ccw/workflows/file-modification.md b/.ccw/workflows/file-modification.md deleted file mode 100644 index 23bd6bf3..00000000 --- a/.ccw/workflows/file-modification.md +++ /dev/null @@ -1,64 +0,0 @@ -# File Modification - -Before modifying files, always: -- Try built-in Edit tool first -- Escalate to MCP tools when built-ins fail -- Use write_file only as last resort - -## MCP Tools Usage - -### edit_file - Modify Files - -**When**: Built-in Edit fails, need dry-run preview, or need line-based operations - -**How**: -```javascript -edit_file(path="/file.ts", oldText="old", newText="new") // Replace text -edit_file(path="/file.ts", oldText="old", newText="new", dryRun=true) // Preview diff -edit_file(path="/file.ts", oldText="old", newText="new", replaceAll=true) // Replace all -edit_file(path="/file.ts", mode="line", operation="insert_after", line=10, text="new line") -edit_file(path="/file.ts", mode="line", operation="delete", line=5, end_line=8) -``` - -**Modes**: `update` (replace text, default), `line` (line-based operations) - -**Operations** (line mode): `insert_before`, `insert_after`, `replace`, `delete` - ---- - -### write_file - Create/Overwrite Files - -**When**: Create new files, completely replace content, or edit_file still fails - -**How**: -```javascript -write_file(path="/new-file.ts", content="file content here") -write_file(path="/existing.ts", content="...", backup=true) // Create backup first -``` - ---- - -## Priority Logic - -> **Note**: Search priority is defined in `context-tools.md` - smart_search has HIGHEST PRIORITY for all discovery tasks. - -**Search & Discovery** (defer to context-tools.md): -1. **smart_search FIRST** for any code/file discovery -2. Built-in Grep only for single-file exact line search (location already confirmed) -3. Exa for external/public knowledge - -**File Reading**: -1. Unknown location → **smart_search first**, then Read -2. Known confirmed file → Built-in Read directly -3. Pattern matching → smart_search (action="find_files") - -**File Editing**: -1. Always try built-in Edit first -2. Fails 1+ times → edit_file (MCP) -3. Still fails → write_file (MCP) - -## Decision Triggers - -**Search tasks** → Always start with smart_search (per context-tools.md) -**Known file edits** → Start with built-in Edit, escalate to MCP if fails -**External knowledge** → Use Exa diff --git a/.ccw/workflows/review-directory-specification.md b/.ccw/workflows/review-directory-specification.md deleted file mode 100644 index e62ed692..00000000 --- a/.ccw/workflows/review-directory-specification.md +++ /dev/null @@ -1,336 +0,0 @@ -# Review Directory Specification - -## Overview - -Unified directory structure for all review commands (session-based and module-based) within workflow sessions. - -## Core Principles - -1. **Session-Based**: All reviews run within a workflow session context -2. **Unified Structure**: Same directory layout for all review types -3. **Type Differentiation**: Review type indicated by metadata, not directory structure -4. **Progressive Creation**: Directories created on-demand during review execution -5. **Archive Support**: Reviews archived with their parent session - -## Directory Structure - -### Base Location -``` -.workflow/active/WFS-{session-id}/.review/ -``` - -### Complete Structure -``` -.workflow/active/WFS-{session-id}/.review/ -├── review-state.json # Review orchestrator state machine -├── review-progress.json # Real-time progress for dashboard polling -├── review-metadata.json # Review configuration and scope -├── dimensions/ # Per-dimension analysis results -│ ├── security.json -│ ├── architecture.json -│ ├── quality.json -│ ├── action-items.json -│ ├── performance.json -│ ├── maintainability.json -│ └── best-practices.json -├── iterations/ # Deep-dive iteration results -│ ├── iteration-1-finding-{uuid}.json -│ ├── iteration-2-finding-{uuid}.json -│ └── ... -├── reports/ # Human-readable reports -│ ├── security-analysis.md -│ ├── security-cli-output.txt -│ ├── architecture-analysis.md -│ ├── architecture-cli-output.txt -│ ├── ... -│ ├── deep-dive-1-{uuid}.md -│ └── deep-dive-2-{uuid}.md -├── REVIEW-SUMMARY.md # Final consolidated summary -└── dashboard.html # Interactive review dashboard -``` - -## Review Metadata Schema - -**File**: `review-metadata.json` - -```json -{ - "review_id": "review-20250125-143022", - "review_type": "module|session", - "session_id": "WFS-auth-system", - "created_at": "2025-01-25T14:30:22Z", - "scope": { - "type": "module|session", - "module_scope": { - "target_pattern": "src/auth/**", - "resolved_files": [ - "src/auth/service.ts", - "src/auth/validator.ts" - ], - "file_count": 2 - }, - "session_scope": { - "commit_range": "abc123..def456", - "changed_files": [ - "src/auth/service.ts", - "src/payment/processor.ts" - ], - "file_count": 2 - } - }, - "dimensions": ["security", "architecture", "quality", "action-items", "performance", "maintainability", "best-practices"], - "max_iterations": 3, - "cli_tools": { - "primary": "gemini", - "fallback": ["qwen", "codex"] - } -} -``` - -## Review State Schema - -**File**: `review-state.json` - -```json -{ - "review_id": "review-20250125-143022", - "phase": "init|parallel|aggregate|iterate|complete", - "current_iteration": 1, - "dimensions_status": { - "security": "pending|in_progress|completed|failed", - "architecture": "completed", - "quality": "in_progress", - "action-items": "pending", - "performance": "pending", - "maintainability": "pending", - "best-practices": "pending" - }, - "severity_distribution": { - "critical": 2, - "high": 5, - "medium": 12, - "low": 8 - }, - "critical_files": [ - "src/auth/service.ts", - "src/payment/processor.ts" - ], - "iterations": [ - { - "iteration": 1, - "findings_selected": ["uuid-1", "uuid-2", "uuid-3"], - "completed_at": "2025-01-25T15:30:00Z" - } - ], - "completion_criteria": { - "critical_count": 0, - "high_count_threshold": 5, - "max_iterations": 3 - }, - "next_action": "execute_parallel_reviews|aggregate_findings|execute_deep_dive|generate_final_report|complete" -} -``` - -## Session Integration - -### Session Discovery - -**review-session-cycle** (auto-discover): -```bash -# Auto-detect active session -/workflow:review-session-cycle - -# Or specify session explicitly -/workflow:review-session-cycle WFS-auth-system -``` - -**review-module-cycle** (require session): -```bash -# Must have active session or specify one -/workflow:review-module-cycle src/auth/** --session WFS-auth-system - -# Or use active session -/workflow:review-module-cycle src/auth/** -``` - -### Session Creation Logic - -**For review-module-cycle**: - -1. **Check Active Session**: Search `.workflow/active/WFS-*` -2. **If Found**: Use active session's `.review/` directory -3. **If Not Found**: - - **Option A** (Recommended): Prompt user to create session first - - **Option B**: Auto-create review-only session: `WFS-review-{pattern-hash}` - -**Recommended Flow**: -```bash -# Step 1: Start session -/workflow:session:start --new "Review auth module" -# Creates: .workflow/active/WFS-review-auth-module/ - -# Step 2: Run review -/workflow:review-module-cycle src/auth/** -# Creates: .workflow/active/WFS-review-auth-module/.review/ -``` - -## Command Phase 1 Requirements - -### Both Commands Must: - -1. **Session Discovery**: - ```javascript - // Check for active session - const sessions = Glob('.workflow/active/WFS-*'); - if (sessions.length === 0) { - // Prompt user to create session first - error("No active session found. Please run /workflow:session:start first"); - } - const sessionId = sessions[0].match(/WFS-[^/]+/)[0]; - ``` - -2. **Create .review/ Structure**: - ```javascript - const reviewDir = `.workflow/active/${sessionId}/.review/`; - - // Create directory structure - Bash(`mkdir -p ${reviewDir}/dimensions`); - Bash(`mkdir -p ${reviewDir}/iterations`); - Bash(`mkdir -p ${reviewDir}/reports`); - ``` - -3. **Initialize Metadata**: - ```javascript - // Write review-metadata.json - Write(`${reviewDir}/review-metadata.json`, JSON.stringify({ - review_id: `review-${timestamp}`, - review_type: "module|session", - session_id: sessionId, - created_at: new Date().toISOString(), - scope: {...}, - dimensions: [...], - max_iterations: 3, - cli_tools: {...} - })); - - // Write review-state.json - Write(`${reviewDir}/review-state.json`, JSON.stringify({ - review_id: `review-${timestamp}`, - phase: "init", - current_iteration: 0, - dimensions_status: {}, - severity_distribution: {}, - critical_files: [], - iterations: [], - completion_criteria: {}, - next_action: "execute_parallel_reviews" - })); - ``` - -4. **Generate Dashboard**: - ```javascript - const template = Read('~/.claude/templates/review-cycle-dashboard.html'); - const dashboard = template - .replace('{{SESSION_ID}}', sessionId) - .replace('{{REVIEW_TYPE}}', reviewType) - .replace('{{REVIEW_DIR}}', reviewDir); - Write(`${reviewDir}/dashboard.html`, dashboard); - - // Output to user - console.log(`📊 Review Dashboard: file://${absolutePath(reviewDir)}/dashboard.html`); - console.log(`📂 Review Output: ${reviewDir}`); - ``` - -## Archive Strategy - -### On Session Completion - -When `/workflow:session:complete` is called: - -1. **Preserve Review Directory**: - ```javascript - // Move entire session including .review/ - Bash(`mv .workflow/active/${sessionId} .workflow/archives/${sessionId}`); - ``` - -2. **Review Archive Structure**: - ``` - .workflow/archives/WFS-auth-system/ - ├── workflow-session.json - ├── IMPL_PLAN.md - ├── TODO_LIST.md - ├── .task/ - ├── .summaries/ - └── .review/ # Review results preserved - ├── review-metadata.json - ├── REVIEW-SUMMARY.md - └── dashboard.html - ``` - -3. **Access Archived Reviews**: - ```bash - # Open archived dashboard - start .workflow/archives/WFS-auth-system/.review/dashboard.html - ``` - -## Benefits - -### 1. Unified Structure -- Same directory layout for all review types -- Consistent file naming and schemas -- Easier maintenance and tooling - -### 2. Session Integration -- Review history tracked with implementation -- Easy correlation between code changes and reviews -- Simplified archiving and retrieval - -### 3. Progressive Creation -- Directories created only when needed -- No upfront overhead -- Clean session initialization - -### 4. Type Flexibility -- Module-based and session-based reviews in same structure -- Type indicated by metadata, not directory layout -- Easy to add new review types - -### 5. Dashboard Consistency -- Same dashboard template for both types -- Unified progress tracking -- Consistent user experience - -## Migration Path - -### For Existing Commands - -**review-session-cycle**: -1. Change output from `.workflow/.reviews/session-{id}/` to `.workflow/active/{session-id}/.review/` -2. Update Phase 1 to use session discovery -3. Add review-metadata.json creation - -**review-module-cycle**: -1. Add session requirement (or auto-create) -2. Change output from `.workflow/.reviews/module-{hash}/` to `.workflow/active/{session-id}/.review/` -3. Update Phase 1 to use session discovery -4. Add review-metadata.json creation - -### Backward Compatibility - -**For existing standalone reviews** in `.workflow/.reviews/`: -- Keep for reference -- Document migration in README -- Provide migration script if needed - -## Implementation Checklist - -- [ ] Update workflow-architecture.md with .review/ structure -- [ ] Update review-session-cycle.md command specification -- [ ] Update review-module-cycle.md command specification -- [ ] Update review-cycle-dashboard.html template -- [ ] Create review-metadata.json schema validation -- [ ] Update /workflow:session:complete to preserve .review/ -- [ ] Update documentation examples -- [ ] Test both review types with new structure -- [ ] Validate dashboard compatibility -- [ ] Document migration path for existing reviews diff --git a/.ccw/workflows/task-core.md b/.ccw/workflows/task-core.md deleted file mode 100644 index e78c6663..00000000 --- a/.ccw/workflows/task-core.md +++ /dev/null @@ -1,214 +0,0 @@ -# Task System Core Reference - -## Overview -Task commands provide single-execution workflow capabilities with full context awareness, hierarchical organization, and agent orchestration. - -## Task JSON Schema -All task files use this simplified 5-field schema: - -```json -{ - "id": "IMPL-1.2", - "title": "Implement JWT authentication", - "status": "pending|active|completed|blocked|container", - - "meta": { - "type": "feature|bugfix|refactor|test-gen|test-fix|docs", - "agent": "@code-developer|@action-planning-agent|@test-fix-agent|@universal-executor" - }, - - "context": { - "requirements": ["JWT authentication", "OAuth2 support"], - "focus_paths": ["src/auth", "tests/auth", "config/auth.json"], - "acceptance": ["JWT validation works", "OAuth flow complete"], - "parent": "IMPL-1", - "depends_on": ["IMPL-1.1"], - "inherited": { - "from": "IMPL-1", - "context": ["Authentication system design completed"] - }, - "shared_context": { - "auth_strategy": "JWT with refresh tokens" - } - }, - - "flow_control": { - "pre_analysis": [ - { - "step": "gather_context", - "action": "Read dependency summaries", - "command": "bash(cat .workflow/*/summaries/IMPL-1.1-summary.md)", - "output_to": "auth_design_context", - "on_error": "skip_optional" - } - ], - "implementation_approach": [ - { - "step": 1, - "title": "Implement JWT authentication system", - "description": "Implement comprehensive JWT authentication system with token generation, validation, and refresh logic", - "modification_points": ["Add JWT token generation", "Implement token validation middleware", "Create refresh token logic"], - "logic_flow": ["User login request → validate credentials", "Generate JWT access and refresh tokens", "Store refresh token securely", "Return tokens to client"], - "depends_on": [], - "output": "jwt_implementation" - } - ], - "target_files": [ - "src/auth/login.ts:handleLogin:75-120", - "src/middleware/auth.ts:validateToken", - "src/auth/PasswordReset.ts" - ] - } -} -``` - -## Field Structure Details - -### focus_paths Field (within context) -**Purpose**: Specifies concrete project paths relevant to task implementation - -**Format**: -- **Array of strings**: `["folder1", "folder2", "specific_file.ts"]` -- **Concrete paths**: Use actual directory/file names without wildcards -- **Mixed types**: Can include both directories and specific files -- **Relative paths**: From project root (e.g., `src/auth`, not `./src/auth`) - -**Examples**: -```json -// Authentication system task -"focus_paths": ["src/auth", "tests/auth", "config/auth.json", "src/middleware/auth.ts"] - -// UI component task -"focus_paths": ["src/components/Button", "src/styles", "tests/components"] -``` - -### flow_control Field Structure -**Purpose**: Universal process manager for task execution - -**Components**: -- **pre_analysis**: Array of sequential process steps -- **implementation_approach**: Task execution strategy -- **target_files**: Files to modify/create - existing files in `file:function:lines` format, new files as `file` only - -**Step Structure**: -```json -{ - "step": "gather_context", - "action": "Human-readable description", - "command": "bash(executable command with [variables])", - "output_to": "variable_name", - "on_error": "skip_optional|fail|retry_once|manual_intervention" -} -``` - -## Hierarchical System - -### Task Hierarchy Rules -- **Format**: IMPL-N (main), IMPL-N.M (subtasks) - uppercase required -- **Maximum Depth**: 2 levels only -- **10-Task Limit**: Hard limit enforced across all tasks -- **Container Tasks**: Parents with subtasks (not executable) -- **Leaf Tasks**: No subtasks (executable) -- **File Cohesion**: Related files must stay in same task - -### Task Complexity Classifications -- **Simple**: ≤5 tasks, single-level tasks, direct execution -- **Medium**: 6-10 tasks, two-level hierarchy, context coordination -- **Over-scope**: >10 tasks requires project re-scoping into iterations - -### Complexity Assessment Rules -- **Creation**: System evaluates and assigns complexity -- **10-task limit**: Hard limit enforced - exceeding requires re-scoping -- **Execution**: Can upgrade (Simple→Medium→Over-scope), triggers re-scoping -- **Override**: Users can manually specify complexity within 10-task limit - -### Status Rules -- **pending**: Ready for execution -- **active**: Currently being executed -- **completed**: Successfully finished -- **blocked**: Waiting for dependencies -- **container**: Has subtasks (parent only) - -## Session Integration - -### Active Session Detection -```bash -# Check for active session in sessions directory -active_session=$(find .workflow/active/ -name 'WFS-*' -type d 2>/dev/null | head -1) -``` - -### Workflow Context Inheritance -Tasks inherit from: -1. `workflow-session.json` - Session metadata -2. Parent task context (for subtasks) -3. `IMPL_PLAN.md` - Planning document - -### File Locations -- **Task JSON**: `.workflow/active/WFS-[topic]/.task/IMPL-*.json` (uppercase required) -- **Session State**: `.workflow/active/WFS-[topic]/workflow-session.json` -- **Planning Doc**: `.workflow/active/WFS-[topic]/IMPL_PLAN.md` -- **Progress**: `.workflow/active/WFS-[topic]/TODO_LIST.md` - -## Agent Mapping - -### Automatic Agent Selection -- **@code-developer**: Implementation tasks, coding, test writing -- **@action-planning-agent**: Design, architecture planning -- **@test-fix-agent**: Test execution, failure diagnosis, code fixing -- **@universal-executor**: Optional manual review (only when explicitly requested) - -### Agent Context Filtering -Each agent receives tailored context: -- **@code-developer**: Complete implementation details, test requirements -- **@action-planning-agent**: High-level requirements, risks, architecture -- **@test-fix-agent**: Test execution, failure diagnosis, code fixing -- **@universal-executor**: Quality standards, security considerations (when requested) - -## Deprecated Fields - -### Legacy paths Field -**Deprecated**: The semicolon-separated `paths` field has been replaced by `context.focus_paths` array. - -**Old Format** (no longer used): -```json -"paths": "src/auth;tests/auth;config/auth.json;src/middleware/auth.ts" -``` - -**New Format** (use this instead): -```json -"context": { - "focus_paths": ["src/auth", "tests/auth", "config/auth.json", "src/middleware/auth.ts"] -} -``` - -## Validation Rules - -### Pre-execution Checks -1. Task exists and is valid JSON -2. Task status allows operation -3. Dependencies are met -4. Active workflow session exists -5. All 5 core fields present (id, title, status, meta, context, flow_control) -6. Total task count ≤ 10 (hard limit) -7. File cohesion maintained in focus_paths - -### Hierarchy Validation -- Parent-child relationships valid -- Maximum depth not exceeded -- Container tasks have subtasks -- No circular dependencies - -## Error Handling Patterns - -### Common Errors -- **Task not found**: Check ID format and session -- **Invalid status**: Verify task can be operated on -- **Missing session**: Ensure active workflow exists -- **Max depth exceeded**: Restructure hierarchy -- **Missing implementation**: Complete required fields - -### Recovery Strategies -- Session validation with clear guidance -- Automatic ID correction suggestions -- Implementation field completion prompts -- Hierarchy restructuring options \ No newline at end of file diff --git a/.ccw/workflows/tool-strategy.md b/.ccw/workflows/tool-strategy.md deleted file mode 100644 index 84a13fcd..00000000 --- a/.ccw/workflows/tool-strategy.md +++ /dev/null @@ -1,216 +0,0 @@ -# Tool Strategy - When to Use What - -> **Focus**: Decision triggers and selection logic, NOT syntax (already registered with Claude) - -## Quick Decision Tree - -``` -Need context? -├─ Exa available? → Use Exa (fastest, most comprehensive) -├─ Large codebase (>500 files)? → codex_lens -├─ Known files (<5)? → Read tool -└─ Unknown files? → smart_search → Read tool - -Need to modify files? -├─ Built-in Edit fails? → mcp__ccw-tools__edit_file -└─ Still fails? → mcp__ccw-tools__write_file - -Need to search? -├─ Semantic/concept search? → smart_search (mode=semantic) -├─ Exact pattern match? → Grep tool -└─ Multiple search modes needed? → smart_search (mode=auto) -``` - ---- - -## 1. Context Gathering Tools - -### Exa (`mcp__exa__get_code_context_exa`) - -**Use When**: -- ✅ Researching external APIs, libraries, frameworks -- ✅ Need recent documentation (post-cutoff knowledge) -- ✅ Looking for implementation examples in public repos -- ✅ Comparing architectural patterns across projects - -**Don't Use When**: -- ❌ Searching internal codebase (use smart_search/codex_lens) -- ❌ Files already in working directory (use Read) - -**Trigger Indicators**: -- User mentions specific library/framework names -- Questions about "best practices", "how does X work" -- Need to verify current API signatures - ---- - -### read_file (`mcp__ccw-tools__read_file`) - -**Use When**: -- ✅ Reading multiple related files at once (batch reading) -- ✅ Need directory traversal with pattern matching -- ✅ Searching file content with regex (`contentPattern`) -- ✅ Want to limit depth/file count for large directories - -**Don't Use When**: -- ❌ Single file read → Use built-in Read tool (faster) -- ❌ Unknown file locations → Use smart_search first -- ❌ Need semantic search → Use smart_search or codex_lens - -**Trigger Indicators**: -- Need to read "all TypeScript files in src/" -- Need to find "files containing TODO comments" -- Want to read "up to 20 config files" - -**Advantages over Built-in Read**: -- Batch operation (multiple files in one call) -- Pattern-based filtering (glob + content regex) -- Directory traversal with depth control - ---- - -### codex_lens (`mcp__ccw-tools__codex_lens`) - -**Use When**: -- ✅ Large codebase (>500 files) requiring repeated searches -- ✅ Need semantic understanding of code relationships -- ✅ Working across multiple sessions (persistent index) -- ✅ Symbol-level navigation needed - -**Don't Use When**: -- ❌ Small project (<100 files) → Use smart_search (no indexing overhead) -- ❌ One-time search → Use smart_search or Grep -- ❌ Files change frequently → Indexing overhead not worth it - -**Trigger Indicators**: -- "Find all implementations of interface X" -- "What calls this function across the codebase?" -- Multi-session workflow on same codebase - -**Action Selection**: -- `init`: First time in new codebase -- `search`: Find code patterns -- `search_files`: Find files by path/name pattern -- `symbol`: Get symbols in specific file -- `status`: Check if index exists/is stale -- `clean`: Remove stale index - ---- - -### smart_search (`mcp__ccw-tools__smart_search`) - -**Use When**: -- ✅ Don't know exact file locations -- ✅ Need concept/semantic search ("authentication logic") -- ✅ Medium-sized codebase (100-500 files) -- ✅ One-time or infrequent searches - -**Don't Use When**: -- ❌ Known exact file path → Use Read directly -- ❌ Large codebase + repeated searches → Use codex_lens -- ❌ Exact pattern match → Use Grep (faster) - -**Mode Selection**: -- `auto`: Let tool decide (default, safest) -- `exact`: Know exact pattern, need fast results -- `fuzzy`: Typo-tolerant file/symbol names -- `semantic`: Concept-based ("error handling", "data validation") -- `graph`: Dependency/relationship analysis - -**Trigger Indicators**: -- "Find files related to user authentication" -- "Where is the payment processing logic?" -- "Locate database connection setup" - ---- - -## 2. File Modification Tools - -### edit_file (`mcp__ccw-tools__edit_file`) - -**Use When**: -- ✅ Built-in Edit tool failed 1+ times -- ✅ Need dry-run preview before applying -- ✅ Need line-based operations (insert_after, insert_before) -- ✅ Need to replace all occurrences - -**Don't Use When**: -- ❌ Built-in Edit hasn't failed yet → Try built-in first -- ❌ Need to create new file → Use write_file - -**Trigger Indicators**: -- Built-in Edit returns "old_string not found" -- Built-in Edit fails due to whitespace/formatting -- Need to verify changes before applying (dryRun=true) - -**Mode Selection**: -- `mode=update`: Replace text (similar to built-in Edit) -- `mode=line`: Line-based operations (insert_after, insert_before, delete) - ---- - -### write_file (`mcp__ccw-tools__write_file`) - -**Use When**: -- ✅ Creating brand new files -- ✅ MCP edit_file still fails (last resort) -- ✅ Need to completely replace file content -- ✅ Need backup before overwriting - -**Don't Use When**: -- ❌ File exists + small change → Use Edit tools -- ❌ Built-in Edit hasn't been tried → Try built-in Edit first - -**Trigger Indicators**: -- All Edit attempts failed -- Need to create new file with specific content -- User explicitly asks to "recreate file" - ---- - -## 3. Decision Logic - -### File Reading Priority - -``` -1. Known single file? → Built-in Read -2. Multiple files OR pattern matching? → mcp__ccw-tools__read_file -3. Unknown location? → smart_search, then Read -4. Large codebase + repeated access? → codex_lens -``` - -### File Editing Priority - -``` -1. Always try built-in Edit first -2. Fails 1+ times? → mcp__ccw-tools__edit_file -3. Still fails? → mcp__ccw-tools__write_file (last resort) -``` - -### Search Tool Priority - -``` -1. External knowledge? → Exa -2. Exact pattern in small codebase? → Built-in Grep -3. Semantic/unknown location? → smart_search -4. Large codebase + repeated searches? → codex_lens -``` - ---- - -## 4. Anti-Patterns - -**Don't**: -- Use codex_lens for one-time searches in small projects -- Use smart_search when file path is already known -- Use write_file before trying Edit tools -- Use Exa for internal codebase searches -- Use read_file for single file when Read tool works - -**Do**: -- Start with simplest tool (Read, Edit, Grep) -- Escalate to MCP tools when built-ins fail -- Use semantic search (smart_search) for exploratory tasks -- Use indexed search (codex_lens) for large, stable codebases -- Use Exa for external/public knowledge - diff --git a/.ccw/workflows/workflow-architecture.md b/.ccw/workflows/workflow-architecture.md deleted file mode 100644 index ac134ab8..00000000 --- a/.ccw/workflows/workflow-architecture.md +++ /dev/null @@ -1,942 +0,0 @@ -# Workflow Architecture - -## Overview - -This document defines the complete workflow system architecture using a **JSON-only data model**, **marker-based session management**, and **unified file structure** with dynamic task decomposition. - -## Core Architecture - -### JSON-Only Data Model -**JSON files (.task/IMPL-*.json) are the only authoritative source of task state. All markdown documents are read-only generated views.** - -- **Task State**: Stored exclusively in JSON files -- **Documents**: Generated on-demand from JSON data -- **No Synchronization**: Eliminates bidirectional sync complexity -- **Performance**: Direct JSON access without parsing overhead - -### Key Design Decisions -- **JSON files are the single source of truth** - All markdown documents are read-only generated views -- **Marker files for session tracking** - Ultra-simple active session management -- **Unified file structure definition** - Same structure template for all workflows, created on-demand -- **Dynamic task decomposition** - Subtasks created as needed during execution -- **On-demand file creation** - Directories and files created only when required -- **Agent-agnostic task definitions** - Complete context preserved for autonomous execution - -## Session Management - -### Directory-Based Session Management -**Simple Location-Based Tracking**: Sessions in `.workflow/active/` directory - -```bash -.workflow/ -├── active/ -│ ├── WFS-oauth-integration/ # Active session directory -│ ├── WFS-user-profile/ # Active session directory -│ └── WFS-bug-fix-123/ # Active session directory -└── archives/ - └── WFS-old-feature/ # Archived session (completed) -``` - - -### Session Operations - -#### Detect Active Session(s) -```bash -active_sessions=$(find .workflow/active/ -name "WFS-*" -type d 2>/dev/null) -count=$(echo "$active_sessions" | wc -l) - -if [ -z "$active_sessions" ]; then - echo "No active session" -elif [ "$count" -eq 1 ]; then - session_name=$(basename "$active_sessions") - echo "Active session: $session_name" -else - echo "Multiple sessions found:" - echo "$active_sessions" | while read session_dir; do - session=$(basename "$session_dir") - echo " - $session" - done - echo "Please specify which session to work with" -fi -``` - -#### Archive Session -```bash -mv .workflow/active/WFS-feature .workflow/archives/WFS-feature -``` - -### Session State Tracking -Each session directory contains `workflow-session.json`: - -```json -{ - "session_id": "WFS-[topic-slug]", - "project": "feature description", - "type": "simple|medium|complex", - "current_phase": "PLAN|IMPLEMENT|REVIEW", - "status": "active|paused|completed", - "progress": { - "completed_phases": ["PLAN"], - "current_tasks": ["IMPL-1", "IMPL-2"] - } -} -``` - -## Task System - -### Hierarchical Task Structure -**Maximum Depth**: 2 levels (IMPL-N.M format) - -``` -IMPL-1 # Main task -IMPL-1.1 # Subtask of IMPL-1 (dynamically created) -IMPL-1.2 # Another subtask of IMPL-1 -IMPL-2 # Another main task -IMPL-2.1 # Subtask of IMPL-2 (dynamically created) -``` - -**Task Status Rules**: -- **Container tasks**: Parent tasks with subtasks (cannot be directly executed) -- **Leaf tasks**: Only these can be executed directly -- **Status inheritance**: Parent status derived from subtask completion - -### Enhanced Task JSON Schema -All task files use this unified 6-field schema with optional artifacts enhancement: - -```json -{ - "id": "IMPL-1.2", - "title": "Implement JWT authentication", - "status": "pending|active|completed|blocked|container", - "context_package_path": ".workflow/WFS-session/.process/context-package.json", - - "meta": { - "type": "feature|bugfix|refactor|test-gen|test-fix|docs", - "agent": "@code-developer|@action-planning-agent|@test-fix-agent|@universal-executor" - }, - - "context": { - "requirements": ["JWT authentication", "OAuth2 support"], - "focus_paths": ["src/auth", "tests/auth", "config/auth.json"], - "acceptance": ["JWT validation works", "OAuth flow complete"], - "parent": "IMPL-1", - "depends_on": ["IMPL-1.1"], - "inherited": { - "from": "IMPL-1", - "context": ["Authentication system design completed"] - }, - "shared_context": { - "auth_strategy": "JWT with refresh tokens" - }, - "artifacts": [ - { - "type": "role_analyses", - "source": "brainstorm_clarification", - "path": ".workflow/WFS-session/.brainstorming/*/analysis*.md", - "priority": "highest", - "contains": "role_specific_requirements_and_design" - } - ] - }, - - "flow_control": { - "pre_analysis": [ - { - "step": "check_patterns", - "action": "Analyze existing patterns", - "command": "bash(rg 'auth' [focus_paths] | head -10)", - "output_to": "patterns" - }, - { - "step": "analyze_architecture", - "action": "Review system architecture", - "command": "gemini \"analyze patterns: [patterns]\"", - "output_to": "design" - }, - { - "step": "check_deps", - "action": "Check dependencies", - "command": "bash(echo [depends_on] | xargs cat)", - "output_to": "context" - } - ], - "implementation_approach": [ - { - "step": 1, - "title": "Set up authentication infrastructure", - "description": "Install JWT library and create auth config following [design] patterns from [parent]", - "modification_points": [ - "Add JWT library dependencies to package.json", - "Create auth configuration file using [parent] patterns" - ], - "logic_flow": [ - "Install jsonwebtoken library via npm", - "Configure JWT secret and expiration from [inherited]", - "Export auth config for use by [jwt_generator]" - ], - "depends_on": [], - "output": "auth_config" - }, - { - "step": 2, - "title": "Implement JWT generation", - "description": "Create JWT token generation logic using [auth_config] and [inherited] validation patterns", - "modification_points": [ - "Add JWT generation function in auth service", - "Implement token signing with [auth_config]" - ], - "logic_flow": [ - "User login → validate credentials with [inherited]", - "Generate JWT payload with user data", - "Sign JWT using secret from [auth_config]", - "Return signed token" - ], - "depends_on": [1], - "output": "jwt_generator" - }, - { - "step": 3, - "title": "Implement JWT validation middleware", - "description": "Create middleware to validate JWT tokens using [auth_config] and [shared] rules", - "modification_points": [ - "Create validation middleware using [jwt_generator]", - "Add token verification using [shared] rules", - "Implement user attachment to request object" - ], - "logic_flow": [ - "Protected route → extract JWT from Authorization header", - "Validate token signature using [auth_config]", - "Check token expiration and [shared] rules", - "Decode payload and attach user to request", - "Call next() or return 401 error" - ], - "command": "bash(npm test -- middleware.test.ts)", - "depends_on": [1, 2], - "output": "auth_middleware" - } - ], - "target_files": [ - "src/auth/login.ts:handleLogin:75-120", - "src/middleware/auth.ts:validateToken", - "src/auth/PasswordReset.ts" - ] - } -} -``` - -### Focus Paths & Context Management - -#### Context Package Path (Top-Level Field) -The **context_package_path** field provides the location of the smart context package: -- **Location**: Top-level field (not in `artifacts` array) -- **Path**: `.workflow/WFS-session/.process/context-package.json` -- **Purpose**: References the comprehensive context package containing project structure, dependencies, and brainstorming artifacts catalog -- **Usage**: Loaded in `pre_analysis` steps via `Read({{context_package_path}})` - -#### Focus Paths Format -The **focus_paths** field specifies concrete project paths for task implementation: -- **Array of strings**: `["folder1", "folder2", "specific_file.ts"]` -- **Concrete paths**: Use actual directory/file names without wildcards -- **Mixed types**: Can include both directories and specific files -- **Relative paths**: From project root (e.g., `src/auth`, not `./src/auth`) - -#### Artifacts Field ⚠️ NEW FIELD -Optional field referencing brainstorming outputs for task execution: - -```json -"artifacts": [ - { - "type": "role_analyses|topic_framework|individual_role_analysis", - "source": "brainstorm_clarification|brainstorm_framework|brainstorm_roles", - "path": ".workflow/WFS-session/.brainstorming/document.md", - "priority": "highest|high|medium|low" - } -] -``` - -**Types & Priority**: role_analyses (highest) → topic_framework (medium) → individual_role_analysis (low) - -#### Flow Control Configuration -The **flow_control** field manages task execution through structured sequential steps. For complete format specifications and usage guidelines, see [Flow Control Format Guide](#flow-control-format-guide) below. - -**Quick Reference**: -- **pre_analysis**: Context gathering steps (supports multiple command types) -- **implementation_approach**: Implementation steps array with dependency management -- **target_files**: Target files for modification (file:function:lines format) -- **Variable references**: Use `[variable_name]` to reference step outputs -- **Tool integration**: Supports Gemini, Codex, Bash commands, and MCP tools - -## Flow Control Format Guide - -The `[FLOW_CONTROL]` marker indicates that a task or prompt contains flow control steps for sequential execution. There are **two distinct formats** used in different scenarios: - -### Format Comparison Matrix - -| Aspect | Inline Format | JSON Format | -|--------|--------------|-------------| -| **Used In** | Brainstorm workflows | Implementation tasks | -| **Agent** | conceptual-planning-agent | code-developer, test-fix-agent, doc-generator | -| **Location** | Task() prompt (markdown) | .task/IMPL-*.json file | -| **Persistence** | Temporary (prompt-only) | Persistent (file storage) | -| **Complexity** | Simple (3-5 steps) | Complex (10+ steps) | -| **Dependencies** | None | Full `depends_on` support | -| **Purpose** | Load brainstorming context | Implement task with preparation | - -### Inline Format (Brainstorm) - -**Marker**: `[FLOW_CONTROL]` written directly in Task() prompt - -**Structure**: Markdown list format - -**Used By**: Brainstorm commands (`auto-parallel.md`, role commands) - -**Agent**: `conceptual-planning-agent` - -**Example**: -```markdown -[FLOW_CONTROL] - -### Flow Control Steps -**AGENT RESPONSIBILITY**: Execute these pre_analysis steps sequentially with context accumulation: - -1. **load_topic_framework** - - Action: Load structured topic discussion framework - - Command: Read(.workflow/WFS-{session}/.brainstorming/guidance-specification.md) - - Output: topic_framework - -2. **load_role_template** - - Action: Load role-specific planning template - - Command: bash($(cat "~/.ccw/workflows/cli-templates/planning-roles/{role}.md")) - - Output: role_template - -3. **load_session_metadata** - - Action: Load session metadata and topic description - - Command: bash(cat .workflow/WFS-{session}/workflow-session.json 2>/dev/null || echo '{}') - - Output: session_metadata -``` - -**Characteristics**: -- 3-5 simple context loading steps -- Written directly in prompt (not persistent) -- No dependency management between steps -- Used for temporary context preparation -- Variables: `[variable_name]` for output references - -### JSON Format (Implementation) - -**Marker**: `[FLOW_CONTROL]` used in TodoWrite or documentation to indicate task has flow control - -**Structure**: Complete JSON structure in task file - -**Used By**: Implementation tasks (IMPL-*.json) - -**Agents**: `code-developer`, `test-fix-agent`, `doc-generator` - -**Example**: -```json -"flow_control": { - "pre_analysis": [ - { - "step": "load_role_analyses", - "action": "Load role analysis documents from brainstorming", - "commands": [ - "bash(ls .workflow/WFS-{session}/.brainstorming/*/analysis*.md 2>/dev/null || echo 'not found')", - "Glob(.workflow/WFS-{session}/.brainstorming/*/analysis*.md)", - "Read(each discovered role analysis file)" - ], - "output_to": "role_analyses", - "on_error": "skip_optional" - }, - { - "step": "local_codebase_exploration", - "action": "Explore codebase using local search", - "commands": [ - "bash(rg '^(function|class|interface).*auth' --type ts -n --max-count 15)", - "bash(find . -name '*auth*' -type f | grep -v node_modules | head -10)" - ], - "output_to": "codebase_structure" - } - ], - "implementation_approach": [ - { - "step": 1, - "title": "Setup infrastructure", - "description": "Install JWT library and create config following [role_analyses]", - "modification_points": [ - "Add JWT library dependencies to package.json", - "Create auth configuration file" - ], - "logic_flow": [ - "Install jsonwebtoken library via npm", - "Configure JWT secret from [role_analyses]", - "Export auth config for use by [jwt_generator]" - ], - "depends_on": [], - "output": "auth_config" - }, - { - "step": 2, - "title": "Implement JWT generation", - "description": "Create JWT token generation logic using [auth_config]", - "modification_points": [ - "Add JWT generation function in auth service", - "Implement token signing with [auth_config]" - ], - "logic_flow": [ - "User login → validate credentials", - "Generate JWT payload with user data", - "Sign JWT using secret from [auth_config]", - "Return signed token" - ], - "depends_on": [1], - "output": "jwt_generator" - } - ], - "target_files": [ - "src/auth/login.ts:handleLogin:75-120", - "src/middleware/auth.ts:validateToken" - ] -} -``` - -**Characteristics**: -- Persistent storage in .task/IMPL-*.json files -- Complete dependency management (`depends_on` arrays) -- Two-phase structure: `pre_analysis` + `implementation_approach` -- Error handling strategies (`on_error` field) -- Target file specifications -- Variables: `[variable_name]` for cross-step references - -### JSON Format Field Specifications - -#### pre_analysis Field -**Purpose**: Context gathering phase before implementation - -**Structure**: Array of step objects with sequential execution - -**Step Fields**: -- **step**: Step identifier (string, e.g., "load_role_analyses") -- **action**: Human-readable description of the step -- **command** or **commands**: Single command string or array of command strings -- **output_to**: Variable name for storing step output -- **on_error**: Error handling strategy (`skip_optional`, `fail`, `retry_once`, `manual_intervention`) - -**Command Types Supported**: -- **Bash commands**: `bash(command)` - Any shell command -- **Tool calls**: `Read(file)`, `Glob(pattern)`, `Grep(pattern)` -- **MCP tools**: `mcp__exa__get_code_context_exa()`, `mcp__exa__web_search_exa()` -- **CLI commands**: `gemini`, `qwen`, `codex --full-auto exec` - -**Example**: -```json -{ - "step": "load_context", - "action": "Load project context and patterns", - "commands": [ - "bash(ccw tool exec get_modules_by_depth '{}')", - "Read(CLAUDE.md)" - ], - "output_to": "project_structure", - "on_error": "skip_optional" -} -``` - -#### implementation_approach Field -**Purpose**: Define implementation steps with dependency management - -**Structure**: Array of step objects (NOT object format) - -**Step Fields (All Required)**: -- **step**: Unique step number (1, 2, 3, ...) - serves as step identifier -- **title**: Brief step title -- **description**: Comprehensive implementation description with context variable references -- **modification_points**: Array of specific code modification targets -- **logic_flow**: Array describing business logic execution sequence -- **depends_on**: Array of step numbers this step depends on (e.g., `[1]`, `[1, 2]`) - empty array `[]` for independent steps -- **output**: Output variable name that can be referenced by subsequent steps via `[output_name]` - -**Optional Fields**: -- **command**: Command for step execution (supports any shell command or CLI tool) - - When omitted: Agent interprets modification_points and logic_flow to execute - - When specified: Command executes the step directly - -**Execution Modes**: -- **Default (without command)**: Agent executes based on modification_points and logic_flow -- **With command**: Specified command handles execution - -**Command Field Usage**: -- **Default approach**: Omit command field - let agent execute autonomously -- **CLI tools (codex/gemini/qwen)**: Add ONLY when user explicitly requests CLI tool usage -- **Simple commands**: Can include bash commands, test commands, validation scripts -- **Complex workflows**: Use command for multi-step operations or tool coordination - -**Command Format Examples** (only when explicitly needed): -```json -// Simple Bash -"command": "bash(npm install package)" -"command": "bash(npm test)" - -// Validation -"command": "bash(test -f config.ts && grep -q 'JWT_SECRET' config.ts)" - -// Codex (user requested) -"command": "codex -C path --full-auto exec \"task\" --skip-git-repo-check -s danger-full-access" - -// Codex Resume (user requested, maintains context) -"command": "codex --full-auto exec \"task\" resume --last --skip-git-repo-check -s danger-full-access" - -// Gemini (user requested) -"command": "gemini \"analyze [context]\"" - -// Qwen (fallback for Gemini) -"command": "qwen \"analyze [context]\"" -``` - -**Example Step**: -```json -{ - "step": 2, - "title": "Implement JWT generation", - "description": "Create JWT token generation logic using [auth_config]", - "modification_points": [ - "Add JWT generation function in auth service", - "Implement token signing with [auth_config]" - ], - "logic_flow": [ - "User login → validate credentials", - "Generate JWT payload with user data", - "Sign JWT using secret from [auth_config]", - "Return signed token" - ], - "depends_on": [1], - "output": "jwt_generator" -} -``` - -#### target_files Field -**Purpose**: Specify files to be modified or created - -**Format**: Array of strings -- **Existing files**: `"file:function:lines"` (e.g., `"src/auth/login.ts:handleLogin:75-120"`) -- **New files**: `"path/to/NewFile.ts"` (file path only) - -### Tool Reference - -**Available Command Types**: - -**Gemini CLI**: -```bash -gemini "prompt" -gemini --approval-mode yolo "prompt" # For write mode -``` - -**Qwen CLI** (Gemini fallback): -```bash -qwen "prompt" -qwen --approval-mode yolo "prompt" # For write mode -``` - -**Codex CLI**: -```bash -codex -C directory --full-auto exec "task" --skip-git-repo-check -s danger-full-access -codex --full-auto exec "task" resume --last --skip-git-repo-check -s danger-full-access -``` - -**Built-in Tools**: -- `Read(file_path)` - Read file contents -- `Glob(pattern)` - Find files by pattern -- `Grep(pattern)` - Search content with regex -- `bash(command)` - Execute bash command - -**MCP Tools**: -- `mcp__exa__get_code_context_exa(query="...")` - Get code context from Exa -- `mcp__exa__web_search_exa(query="...")` - Web search via Exa - -**Bash Commands**: -```bash -bash(rg 'pattern' src/) -bash(find . -name "*.ts") -bash(npm test) -bash(git log --oneline | head -5) -``` - -### Variable System & Context Flow - -**Variable Reference Syntax**: -Both formats use `[variable_name]` syntax for referencing outputs from previous steps. - -**Variable Types**: -- **Step outputs**: `[step_output_name]` - Reference any pre_analysis step output -- **Task properties**: `[task_property]` - Reference any task context field -- **Previous results**: `[analysis_result]` - Reference accumulated context -- **Implementation outputs**: Reference outputs from previous implementation steps - -**Examples**: -```json -// Reference pre_analysis output -"description": "Install JWT library following [role_analyses]" - -// Reference previous step output -"description": "Create middleware using [auth_config] and [jwt_generator]" - -// Reference task context -"command": "bash(cd [focus_paths] && npm test)" -``` - -**Context Accumulation Process**: -1. **Structure Analysis**: `get_modules_by_depth.sh` → project hierarchy -2. **Pattern Analysis**: Tool-specific commands → existing patterns -3. **Dependency Mapping**: Previous task summaries → inheritance context -4. **Task Context Generation**: Combined analysis → task.context fields - -**Context Inheritance Rules**: -- **Parent → Child**: Container tasks pass context via `context.inherited` -- **Dependency → Dependent**: Previous task summaries via `context.depends_on` -- **Session → Task**: Global session context included in all tasks -- **Module → Feature**: Module patterns inform feature implementation - -### Agent Processing Rules - -**conceptual-planning-agent** (Inline Format): -- Parses markdown list from prompt -- Executes 3-5 simple loading steps -- No dependency resolution needed -- Accumulates context in variables -- Used only in brainstorm workflows - -**code-developer, test-fix-agent** (JSON Format): -- Loads complete task JSON from file -- Executes `pre_analysis` steps sequentially -- Processes `implementation_approach` with dependency resolution -- Handles complex variable substitution -- Updates task status in JSON file - -### Usage Guidelines - -**Use Inline Format When**: -- Running brainstorm workflows -- Need 3-5 simple context loading steps -- No persistence required -- No dependencies between steps -- Temporary context preparation - -**Use JSON Format When**: -- Implementing features or tasks -- Need 10+ complex execution steps -- Require dependency management -- Need persistent task definitions -- Complex variable flow between steps -- Error handling strategies needed - -### Variable Reference Syntax - -Both formats use `[variable_name]` syntax for referencing outputs: - -**Inline Format**: -```markdown -2. **analyze_context** - - Action: Analyze using [topic_framework] and [role_template] - - Output: analysis_results -``` - -**JSON Format**: -```json -{ - "step": 2, - "description": "Implement following [role_analyses] and [codebase_structure]", - "depends_on": [1], - "output": "implementation" -} -``` - -### Task Validation Rules -1. **ID Uniqueness**: All task IDs must be unique -2. **Hierarchical Format**: Must follow IMPL-N[.M] pattern (maximum 2 levels) -3. **Parent References**: All parent IDs must exist as JSON files -4. **Status Consistency**: Status values from defined enumeration -5. **Required Fields**: All 5 core fields must be present (id, title, status, meta, context, flow_control) -6. **Focus Paths Structure**: context.focus_paths must contain concrete paths (no wildcards) -7. **Flow Control Format**: pre_analysis must be array with required fields -8. **Dependency Integrity**: All task-level depends_on references must exist as JSON files -9. **Artifacts Structure**: context.artifacts (optional) must use valid type, priority, and path format -10. **Implementation Steps Array**: implementation_approach must be array of step objects -11. **Step Number Uniqueness**: All step numbers within a task must be unique and sequential (1, 2, 3, ...) -12. **Step Dependencies**: All step-level depends_on numbers must reference valid steps within same task -13. **Step Sequence**: Step numbers should match array order (first item step=1, second item step=2, etc.) -14. **Step Required Fields**: Each step must have step, title, description, modification_points, logic_flow, depends_on, output -15. **Step Optional Fields**: command field is optional - when omitted, agent executes based on modification_points and logic_flow - -## Workflow Structure - -### Unified File Structure -All workflows use the same file structure definition regardless of complexity. **Directories and files are created on-demand as needed**, not all at once during initialization. - -#### Complete Structure Reference -``` -.workflow/ -├── [.scratchpad/] # Non-session-specific outputs (created when needed) -│ ├── analyze-*-[timestamp].md # One-off analysis results -│ ├── chat-*-[timestamp].md # Standalone chat sessions -│ ├── plan-*-[timestamp].md # Ad-hoc planning notes -│ ├── bug-index-*-[timestamp].md # Quick bug analyses -│ ├── code-analysis-*-[timestamp].md # Standalone code analysis -│ ├── execute-*-[timestamp].md # Ad-hoc implementation logs -│ └── codex-execute-*-[timestamp].md # Multi-stage execution logs -│ -├── [design-run-*/] # Standalone UI design outputs (created when needed) -│ └── (timestamped)/ # Timestamped design runs without session -│ ├── .intermediates/ # Intermediate analysis files -│ │ ├── style-analysis/ # Style analysis data -│ │ │ ├── computed-styles.json # Extracted CSS values -│ │ │ └── design-space-analysis.json # Design directions -│ │ └── layout-analysis/ # Layout analysis data -│ │ ├── dom-structure-{target}.json # DOM extraction -│ │ └── inspirations/ # Layout research -│ │ └── {target}-layout-ideas.txt -│ ├── style-extraction/ # Final design systems -│ │ ├── style-1/ # design-tokens.json, style-guide.md -│ │ └── style-N/ -│ ├── layout-extraction/ # Layout templates -│ │ └── layout-templates.json -│ ├── prototypes/ # Generated HTML/CSS prototypes -│ │ ├── {target}-style-{s}-layout-{l}.html # Final prototypes -│ │ ├── compare.html # Interactive matrix view -│ │ └── index.html # Navigation page -│ └── .run-metadata.json # Run configuration -│ -├── active/ # Active workflow sessions -│ └── WFS-[topic-slug]/ -│ ├── workflow-session.json # Session metadata and state (REQUIRED) -│ ├── [.brainstorming/] # Optional brainstorming phase (created when needed) -│ ├── [.chat/] # CLI interaction sessions (created when analysis is run) -│ │ ├── chat-*.md # Saved chat sessions -│ │ └── analysis-*.md # Analysis results -│ ├── [.process/] # Planning analysis results (created by /workflow-plan) -│ │ └── ANALYSIS_RESULTS.md # Analysis results and planning artifacts -│ ├── IMPL_PLAN.md # Planning document (REQUIRED) -│ ├── TODO_LIST.md # Progress tracking (REQUIRED) -│ ├── [.summaries/] # Task completion summaries (created when tasks complete) -│ │ ├── IMPL-*-summary.md # Main task summaries -│ │ └── IMPL-*.*-summary.md # Subtask summaries -│ ├── [.review/] # Code review results (created by review commands) -│ │ ├── review-metadata.json # Review configuration and scope -│ │ ├── review-state.json # Review state machine -│ │ ├── review-progress.json # Real-time progress tracking -│ │ ├── dimensions/ # Per-dimension analysis results -│ │ ├── iterations/ # Deep-dive iteration results -│ │ ├── reports/ # Human-readable reports and CLI outputs -│ │ ├── REVIEW-SUMMARY.md # Final consolidated summary -│ │ └── dashboard.html # Interactive review dashboard -│ ├── [design-*/] # UI design outputs (created by ui-design workflows) -│ │ ├── .intermediates/ # Intermediate analysis files -│ │ │ ├── style-analysis/ # Style analysis data -│ │ │ │ ├── computed-styles.json # Extracted CSS values -│ │ │ │ └── design-space-analysis.json # Design directions -│ │ │ └── layout-analysis/ # Layout analysis data -│ │ │ ├── dom-structure-{target}.json # DOM extraction -│ │ │ └── inspirations/ # Layout research -│ │ │ └── {target}-layout-ideas.txt -│ │ ├── style-extraction/ # Final design systems -│ │ │ ├── style-1/ # design-tokens.json, style-guide.md -│ │ │ └── style-N/ -│ │ ├── layout-extraction/ # Layout templates -│ │ │ └── layout-templates.json -│ │ ├── prototypes/ # Generated HTML/CSS prototypes -│ │ │ ├── {target}-style-{s}-layout-{l}.html # Final prototypes -│ │ │ ├── compare.html # Interactive matrix view -│ │ │ └── index.html # Navigation page -│ │ └── .run-metadata.json # Run configuration -│ └── .task/ # Task definitions (REQUIRED) -│ ├── IMPL-*.json # Main task definitions -│ └── IMPL-*.*.json # Subtask definitions (created dynamically) -└── archives/ # Completed workflow sessions - └── WFS-[completed-topic]/ # Archived session directories -``` - -#### Creation Strategy -- **Initial Setup**: Create only `workflow-session.json`, `IMPL_PLAN.md`, `TODO_LIST.md`, and `.task/` directory -- **On-Demand Creation**: Other directories created when first needed -- **Dynamic Files**: Subtask JSON files created during task decomposition -- **Scratchpad Usage**: `.scratchpad/` created when CLI commands run without active session -- **Design Usage**: `design-{timestamp}/` created by UI design workflows in `.workflow/` directly for standalone design runs -- **Review Usage**: `.review/` created by review commands (`/workflow:review-module-cycle`, `/workflow:review-session-cycle`) for comprehensive code quality analysis -- **Intermediate Files**: `.intermediates/` contains analysis data (style/layout) separate from final deliverables -- **Layout Templates**: `layout-extraction/layout-templates.json` contains structural templates for UI assembly - -#### Scratchpad Directory (.scratchpad/) -**Purpose**: Centralized location for non-session-specific CLI outputs - -**When to Use**: -1. **No Active Session**: CLI analysis/chat commands run without an active workflow session -2. **Unrelated Analysis**: Quick analysis not related to current active session -3. **Exploratory Work**: Ad-hoc investigation before creating formal workflow -4. **One-Off Queries**: Standalone questions or debugging without workflow context - -**Output Routing Logic**: -- **IF** active session exists in `.workflow/active/` AND command is session-relevant: - - Save to `.workflow/active/WFS-[id]/.chat/[command]-[timestamp].md` -- **ELSE** (no session OR one-off analysis): - - Save to `.workflow/.scratchpad/[command]-[description]-[timestamp].md` - -**File Naming Pattern**: `[command-type]-[brief-description]-[timestamp].md` - -**Examples**: - -*Workflow Commands (lightweight):* -- `/workflow-lite-plan "feature idea"` (exploratory) → `.scratchpad/lite-plan-feature-idea-20250105-143110.md` -- `/workflow:lite-fix "bug description"` (bug fixing) → `.scratchpad/lite-fix-bug-20250105-143130.md` - -> **Note**: Direct CLI commands (`/cli:analyze`, `/cli:execute`, etc.) have been replaced by semantic invocation and workflow commands. - -**Maintenance**: -- Periodically review and clean up old scratchpad files -- Promote useful analyses to formal workflow sessions if needed -- No automatic cleanup - manual management recommended - -### File Naming Conventions - -#### Session Identifiers -**Format**: `WFS-[topic-slug]` - -**WFS Prefix Meaning**: -- `WFS` = **W**ork**F**low **S**ession -- Identifies directories as workflow session containers -- Distinguishes workflow sessions from other project directories - -**Naming Rules**: -- Convert topic to lowercase with hyphens (e.g., "User Auth System" → `WFS-user-auth-system`) -- Add `-NNN` suffix only if conflicts exist (e.g., `WFS-payment-integration-002`) -- Maximum length: 50 characters including WFS- prefix - -#### Document Naming -- `workflow-session.json` - Session state (required) -- `IMPL_PLAN.md` - Planning document (required) -- `TODO_LIST.md` - Progress tracking (auto-generated when needed) -- Chat sessions: `chat-analysis-*.md` -- Task summaries: `IMPL-[task-id]-summary.md` - -### Document Templates - -#### TODO_LIST.md Template -```markdown -# Tasks: [Session Topic] - -## Task Progress -▸ **IMPL-001**: [Main Task Group] → [📋](./.task/IMPL-001.json) - - [ ] **IMPL-001.1**: [Subtask] → [📋](./.task/IMPL-001.1.json) - - [x] **IMPL-001.2**: [Subtask] → [📋](./.task/IMPL-001.2.json) | [✅](./.summaries/IMPL-001.2-summary.md) - -- [x] **IMPL-002**: [Simple Task] → [📋](./.task/IMPL-002.json) | [✅](./.summaries/IMPL-002-summary.md) - -## Status Legend -- `▸` = Container task (has subtasks) -- `- [ ]` = Pending leaf task -- `- [x]` = Completed leaf task -- Maximum 2 levels: Main tasks and subtasks only -``` - -## Operations Guide - -### Session Management -```bash -# Create minimal required structure -mkdir -p .workflow/active/WFS-topic-slug/.task -echo '{"session_id":"WFS-topic-slug",...}' > .workflow/active/WFS-topic-slug/workflow-session.json -echo '# Implementation Plan' > .workflow/active/WFS-topic-slug/IMPL_PLAN.md -echo '# Tasks' > .workflow/active/WFS-topic-slug/TODO_LIST.md -``` - -### Task Operations -```bash -# Create task -echo '{"id":"IMPL-1","title":"New task",...}' > .task/IMPL-1.json - -# Update task status -jq '.status = "active"' .task/IMPL-1.json > temp && mv temp .task/IMPL-1.json - -# Generate TODO list from JSON state -generate_todo_list_from_json .task/ -``` - -### Directory Creation (On-Demand) -```bash -mkdir -p .brainstorming # When brainstorming is initiated -mkdir -p .chat # When analysis commands are run -mkdir -p .summaries # When first task completes -``` - -### Session Consistency Checks & Recovery -```bash -# Validate session directory structure -if [ -d ".workflow/active/" ]; then - for session_dir in .workflow/active/WFS-*; do - if [ ! -f "$session_dir/workflow-session.json" ]; then - echo "⚠️ Missing workflow-session.json in $session_dir" - fi - done -fi -``` - -**Recovery Strategies**: -- **Missing Session File**: Recreate workflow-session.json from template -- **Corrupted Session File**: Restore from template with basic metadata -- **Broken Task Hierarchy**: Reconstruct parent-child relationships from task JSON files -- **Orphaned Sessions**: Move incomplete sessions to archives/ - -## Complexity Classification - -### Task Complexity Rules -**Complexity is determined by task count and decomposition needs:** - -| Complexity | Task Count | Hierarchy Depth | Decomposition Behavior | -|------------|------------|----------------|----------------------| -| **Simple** | <5 tasks | 1 level (IMPL-N) | Direct execution, minimal decomposition | -| **Medium** | 5-15 tasks | 2 levels (IMPL-N.M) | Moderate decomposition, context coordination | -| **Complex** | >15 tasks | 2 levels (IMPL-N.M) | Frequent decomposition, multi-agent orchestration | - -### Workflow Characteristics & Tool Guidance - -#### Simple Workflows -- **Examples**: Bug fixes, small feature additions, configuration changes -- **Task Decomposition**: Usually single-level tasks, minimal breakdown needed -- **Agent Coordination**: Direct execution without complex orchestration -- **Tool Strategy**: `bash()` commands, `grep()` for pattern matching - -#### Medium Workflows -- **Examples**: New features, API endpoints with integration, database schema changes -- **Task Decomposition**: Two-level hierarchy when decomposition is needed -- **Agent Coordination**: Context coordination between related tasks -- **Tool Strategy**: `gemini` for pattern analysis, `codex --full-auto` for implementation - -#### Complex Workflows -- **Examples**: Major features, architecture refactoring, security implementations, multi-service deployments -- **Task Decomposition**: Frequent use of two-level hierarchy with dynamic subtask creation -- **Agent Coordination**: Multi-agent orchestration with deep context analysis -- **Tool Strategy**: `gemini` for architecture analysis, `codex --full-auto` for complex problem solving, `bash()` commands for flexible analysis - -### Assessment & Upgrades -- **During Creation**: System evaluates requirements and assigns complexity -- **During Execution**: Can upgrade (Simple→Medium→Complex) but never downgrade -- **Override Allowed**: Users can specify higher complexity manually - -## Agent Integration - -### Agent Assignment -Based on task type and title keywords: -- **Planning tasks** → @action-planning-agent -- **Implementation** → @code-developer (code + tests) -- **Test execution/fixing** → @test-fix-agent -- **Review** → @universal-executor (optional, only when explicitly requested) - -### Execution Context -Agents receive complete task JSON plus workflow context: -```json -{ - "task": { /* complete task JSON */ }, - "workflow": { - "session": "WFS-user-auth", - "phase": "IMPLEMENT" - } -} -``` - diff --git a/.gitignore b/.gitignore index 706af4ad..dd10c194 100644 --- a/.gitignore +++ b/.gitignore @@ -143,3 +143,21 @@ ccw/.tmp-ccw-auth-home/ docs/node_modules/ docs/.vitepress/dist/ docs/.vitepress/cache/ +codex-lens/.cache/huggingface/hub/models--Xenova--ms-marco-MiniLM-L-6-v2/refs/main +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/.gitattributes +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/config.json +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/quantize_config.json +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/README.md +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/special_tokens_map.json +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/tokenizer_config.json +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/tokenizer.json +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/vocab.txt +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_bnb4.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_fp16.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_int8.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_q4.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_q4f16.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_quantized.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_uint8.onnx +codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model.onnx +codex-lens/data/registry.db diff --git a/codex-lens-v2/conftest.py b/codex-lens-v2/conftest.py new file mode 100644 index 00000000..abc6f46c --- /dev/null +++ b/codex-lens-v2/conftest.py @@ -0,0 +1,5 @@ +import sys +import os + +# Ensure the local src directory takes precedence over any installed codexlens package +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "src")) diff --git a/codex-lens-v2/pyproject.toml b/codex-lens-v2/pyproject.toml new file mode 100644 index 00000000..c5834b00 --- /dev/null +++ b/codex-lens-v2/pyproject.toml @@ -0,0 +1,36 @@ +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[project] +name = "codex-lens-v2" +version = "0.1.0" +description = "Minimal code semantic search library with 2-stage pipeline" +requires-python = ">=3.10" +dependencies = [] + +[project.optional-dependencies] +semantic = [ + "hnswlib>=0.8.0", + "numpy>=1.26", + "fastembed>=0.4.0,<2.0", +] +gpu = [ + "onnxruntime-gpu>=1.16", +] +faiss-cpu = [ + "faiss-cpu>=1.7.4", +] +faiss-gpu = [ + "faiss-gpu>=1.7.4", +] +reranker-api = [ + "httpx>=0.25", +] +dev = [ + "pytest>=7.0", + "pytest-cov", +] + +[tool.hatch.build.targets.wheel] +packages = ["src/codexlens"] diff --git a/codex-lens-v2/scripts/index_and_search.py b/codex-lens-v2/scripts/index_and_search.py new file mode 100644 index 00000000..bad0e666 --- /dev/null +++ b/codex-lens-v2/scripts/index_and_search.py @@ -0,0 +1,128 @@ +""" +对 D:/Claude_dms3 仓库进行索引并测试搜索。 +用法: python scripts/index_and_search.py +""" +import sys +import time +from pathlib import Path + +# 确保 src 可被导入 +sys.path.insert(0, str(Path(__file__).parent.parent / "src")) + +from codexlens.config import Config +from codexlens.core.factory import create_ann_index, create_binary_index +from codexlens.embed.local import FastEmbedEmbedder +from codexlens.indexing import IndexingPipeline +from codexlens.rerank.local import FastEmbedReranker +from codexlens.search.fts import FTSEngine +from codexlens.search.pipeline import SearchPipeline + +# ─── 配置 ────────────────────────────────────────────────────────────────── +REPO_ROOT = Path("D:/Claude_dms3") +INDEX_DIR = Path("D:/Claude_dms3/codex-lens-v2/.index_cache") +EXTENSIONS = {".py", ".ts", ".js", ".md"} +MAX_FILE_SIZE = 50_000 # bytes +MAX_CHUNK_CHARS = 800 # 每个 chunk 的最大字符数 +CHUNK_OVERLAP = 100 + +# ─── 文件收集 ─────────────────────────────────────────────────────────────── +SKIP_DIRS = { + ".git", "node_modules", "__pycache__", ".pytest_cache", + "dist", "build", ".venv", "venv", ".cache", ".index_cache", + "codex-lens-v2", # 不索引自身 +} + +def collect_files(root: Path) -> list[Path]: + files = [] + for p in root.rglob("*"): + if any(part in SKIP_DIRS for part in p.parts): + continue + if p.is_file() and p.suffix in EXTENSIONS: + if p.stat().st_size <= MAX_FILE_SIZE: + files.append(p) + return files + +# ─── 主流程 ───────────────────────────────────────────────────────────────── +def main(): + INDEX_DIR.mkdir(parents=True, exist_ok=True) + + # 1. 使用小 profile 加快速度 + config = Config( + embed_model="BAAI/bge-small-en-v1.5", + embed_dim=384, + embed_batch_size=32, + hnsw_ef=100, + hnsw_M=16, + binary_top_k=100, + ann_top_k=30, + reranker_top_k=10, + ) + + print("=== codex-lens-v2 索引测试 ===\n") + + # 2. 收集文件 + print(f"[1/4] 扫描 {REPO_ROOT} ...") + files = collect_files(REPO_ROOT) + print(f" 找到 {len(files)} 个文件") + + # 3. 初始化组件 + print(f"\n[2/4] 加载嵌入模型 (bge-small-en-v1.5, dim=384) ...") + embedder = FastEmbedEmbedder(config) + binary_store = create_binary_index(INDEX_DIR, config.embed_dim, config) + ann_index = create_ann_index(INDEX_DIR, config.embed_dim, config) + fts = FTSEngine(":memory:") # 内存 FTS,不持久化 + + # 4. 使用 IndexingPipeline 并行索引 (chunk -> embed -> index) + print(f"[3/4] 并行索引 {len(files)} 个文件 ...") + pipeline = IndexingPipeline( + embedder=embedder, + binary_store=binary_store, + ann_index=ann_index, + fts=fts, + config=config, + ) + stats = pipeline.index_files( + files, + root=REPO_ROOT, + max_chunk_chars=MAX_CHUNK_CHARS, + chunk_overlap=CHUNK_OVERLAP, + max_file_size=MAX_FILE_SIZE, + ) + print(f" 索引完成: {stats.files_processed} 文件, {stats.chunks_created} chunks ({stats.duration_seconds:.1f}s)") + + # 5. 搜索测试 + print(f"\n[4/4] 构建 SearchPipeline ...") + reranker = FastEmbedReranker(config) + pipeline = SearchPipeline( + embedder=embedder, + binary_store=binary_store, + ann_index=ann_index, + reranker=reranker, + fts=fts, + config=config, + ) + + queries = [ + "authentication middleware function", + "def embed_single", + "RRF fusion weights", + "fastembed TextCrossEncoder reranker", + "how to search code semantic", + ] + + print("\n" + "=" * 60) + for query in queries: + t0 = time.time() + results = pipeline.search(query, top_k=5) + elapsed = time.time() - t0 + print(f"\nQuery: {query!r} ({elapsed*1000:.0f}ms)") + if results: + for r in results: + print(f" [{r.score:.3f}] {r.path}") + else: + print(" (无结果)") + print("=" * 60) + print("\n测试完成 ✓") + +if __name__ == "__main__": + main() diff --git a/codex-lens-v2/src/codexlens/__init__.py b/codex-lens-v2/src/codexlens/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/codex-lens-v2/src/codexlens/config.py b/codex-lens-v2/src/codexlens/config.py new file mode 100644 index 00000000..6f8d7ddd --- /dev/null +++ b/codex-lens-v2/src/codexlens/config.py @@ -0,0 +1,99 @@ +from __future__ import annotations +import logging +from dataclasses import dataclass, field + +log = logging.getLogger(__name__) + + +@dataclass +class Config: + # Embedding + embed_model: str = "jinaai/jina-embeddings-v2-base-code" + embed_dim: int = 768 + embed_batch_size: int = 64 + + # GPU / execution providers + device: str = "auto" # 'auto', 'cuda', 'cpu' + embed_providers: list[str] | None = None # explicit ONNX providers override + + # Backend selection: 'auto', 'faiss', 'hnswlib' + ann_backend: str = "auto" + binary_backend: str = "auto" + + # Indexing pipeline + index_workers: int = 2 # number of parallel indexing workers + + # HNSW index (ANNIndex) + hnsw_ef: int = 150 + hnsw_M: int = 32 + hnsw_ef_construction: int = 200 + + # Binary coarse search (BinaryStore) + binary_top_k: int = 200 + + # ANN fine search + ann_top_k: int = 50 + + # Reranker + reranker_model: str = "BAAI/bge-reranker-v2-m3" + reranker_top_k: int = 20 + reranker_batch_size: int = 32 + + # API reranker (optional) + reranker_api_url: str = "" + reranker_api_key: str = "" + reranker_api_model: str = "" + reranker_api_max_tokens_per_batch: int = 2048 + + # FTS + fts_top_k: int = 50 + + # Fusion + fusion_k: int = 60 # RRF k parameter + fusion_weights: dict = field(default_factory=lambda: { + "exact": 0.25, + "fuzzy": 0.10, + "vector": 0.50, + "graph": 0.15, + }) + + def resolve_embed_providers(self) -> list[str]: + """Return ONNX execution providers based on device config. + + Priority: explicit embed_providers > device setting > auto-detect. + """ + if self.embed_providers is not None: + return list(self.embed_providers) + + if self.device == "cuda": + return ["CUDAExecutionProvider", "CPUExecutionProvider"] + + if self.device == "cpu": + return ["CPUExecutionProvider"] + + # auto-detect + try: + import onnxruntime + available = onnxruntime.get_available_providers() + if "CUDAExecutionProvider" in available: + log.info("CUDA detected via onnxruntime, using GPU for embedding") + return ["CUDAExecutionProvider", "CPUExecutionProvider"] + except ImportError: + pass + + return ["CPUExecutionProvider"] + + @classmethod + def defaults(cls) -> "Config": + return cls() + + @classmethod + def small(cls) -> "Config": + """Smaller config for testing or small corpora.""" + return cls( + hnsw_ef=50, + hnsw_M=16, + binary_top_k=50, + ann_top_k=20, + reranker_top_k=10, + ) diff --git a/codex-lens-v2/src/codexlens/core/__init__.py b/codex-lens-v2/src/codexlens/core/__init__.py new file mode 100644 index 00000000..fde43df6 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/__init__.py @@ -0,0 +1,13 @@ +from .base import BaseANNIndex, BaseBinaryIndex +from .binary import BinaryStore +from .factory import create_ann_index, create_binary_index +from .index import ANNIndex + +__all__ = [ + "BaseANNIndex", + "BaseBinaryIndex", + "ANNIndex", + "BinaryStore", + "create_ann_index", + "create_binary_index", +] diff --git a/codex-lens-v2/src/codexlens/core/base.py b/codex-lens-v2/src/codexlens/core/base.py new file mode 100644 index 00000000..20820347 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/base.py @@ -0,0 +1,83 @@ +from __future__ import annotations + +from abc import ABC, abstractmethod + +import numpy as np + + +class BaseANNIndex(ABC): + """Abstract base class for approximate nearest neighbor indexes.""" + + @abstractmethod + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add float32 vectors with corresponding IDs. + + Args: + ids: shape (N,) int64 + vectors: shape (N, dim) float32 + """ + + @abstractmethod + def fine_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search for nearest neighbors. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results + + Returns: + (ids, distances) as numpy arrays + """ + + @abstractmethod + def save(self) -> None: + """Persist index to disk.""" + + @abstractmethod + def load(self) -> None: + """Load index from disk.""" + + @abstractmethod + def __len__(self) -> int: + """Return the number of indexed items.""" + + +class BaseBinaryIndex(ABC): + """Abstract base class for binary vector indexes (Hamming distance).""" + + @abstractmethod + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add float32 vectors (will be binary-quantized internally). + + Args: + ids: shape (N,) int64 + vectors: shape (N, dim) float32 + """ + + @abstractmethod + def coarse_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search by Hamming distance. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results + + Returns: + (ids, distances) sorted ascending by distance + """ + + @abstractmethod + def save(self) -> None: + """Persist store to disk.""" + + @abstractmethod + def load(self) -> None: + """Load store from disk.""" + + @abstractmethod + def __len__(self) -> int: + """Return the number of stored items.""" diff --git a/codex-lens-v2/src/codexlens/core/binary.py b/codex-lens-v2/src/codexlens/core/binary.py new file mode 100644 index 00000000..59f5a432 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/binary.py @@ -0,0 +1,173 @@ +from __future__ import annotations + +import logging +import math +from pathlib import Path + +import numpy as np + +from codexlens.config import Config +from codexlens.core.base import BaseBinaryIndex + +logger = logging.getLogger(__name__) + + +class BinaryStore(BaseBinaryIndex): + """Persistent binary vector store using numpy memmap. + + Stores binary-quantized float32 vectors as packed uint8 arrays on disk. + Supports fast coarse search via XOR + popcount Hamming distance. + """ + + def __init__(self, path: str | Path, dim: int, config: Config) -> None: + self._dir = Path(path) + self._dim = dim + self._config = config + self._packed_bytes = math.ceil(dim / 8) + + self._bin_path = self._dir / "binary_store.bin" + self._ids_path = self._dir / "binary_store_ids.npy" + + self._matrix: np.ndarray | None = None # shape (N, packed_bytes), uint8 + self._ids: np.ndarray | None = None # shape (N,), int64 + self._count: int = 0 + + if self._bin_path.exists() and self._ids_path.exists(): + self.load() + + # ------------------------------------------------------------------ + # Internal helpers + # ------------------------------------------------------------------ + + def _quantize(self, vectors: np.ndarray) -> np.ndarray: + """Convert float32 vectors (N, dim) to packed uint8 (N, packed_bytes).""" + binary = (vectors > 0).astype(np.uint8) + packed = np.packbits(binary, axis=1) + return packed + + def _quantize_single(self, vec: np.ndarray) -> np.ndarray: + """Convert a single float32 vector (dim,) to packed uint8 (packed_bytes,).""" + binary = (vec > 0).astype(np.uint8) + return np.packbits(binary) + + # ------------------------------------------------------------------ + # Public API + # ------------------------------------------------------------------ + + def _ensure_capacity(self, needed: int) -> None: + """Grow pre-allocated matrix/ids arrays to fit *needed* total items.""" + if self._matrix is not None and self._matrix.shape[0] >= needed: + return + + new_cap = max(1024, needed) + # Double until large enough + if self._matrix is not None: + cur_cap = self._matrix.shape[0] + new_cap = max(cur_cap, 1024) + while new_cap < needed: + new_cap *= 2 + + new_matrix = np.zeros((new_cap, self._packed_bytes), dtype=np.uint8) + new_ids = np.zeros(new_cap, dtype=np.int64) + + if self._matrix is not None and self._count > 0: + new_matrix[: self._count] = self._matrix[: self._count] + new_ids[: self._count] = self._ids[: self._count] + + self._matrix = new_matrix + self._ids = new_ids + + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add float32 vectors and their ids. + + Does NOT call save() internally -- callers must call save() + explicitly after batch indexing. + + Args: + ids: shape (N,) int64 + vectors: shape (N, dim) float32 + """ + if len(ids) == 0: + return + + packed = self._quantize(vectors) # (N, packed_bytes) + n = len(ids) + + self._ensure_capacity(self._count + n) + self._matrix[self._count : self._count + n] = packed + self._ids[self._count : self._count + n] = ids.astype(np.int64) + self._count += n + + def coarse_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search by Hamming distance. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results; defaults to config.binary_top_k + + Returns: + (ids, distances) sorted ascending by Hamming distance + """ + if self._matrix is None or self._count == 0: + return np.array([], dtype=np.int64), np.array([], dtype=np.int32) + + k = top_k if top_k is not None else self._config.binary_top_k + k = min(k, self._count) + + query_bin = self._quantize_single(query_vec) # (packed_bytes,) + + # Slice to active region (matrix may be pre-allocated larger) + active_matrix = self._matrix[: self._count] + active_ids = self._ids[: self._count] + + # XOR then popcount via unpackbits + xor = np.bitwise_xor(active_matrix, query_bin[np.newaxis, :]) # (N, packed_bytes) + dists = np.unpackbits(xor, axis=1).sum(axis=1).astype(np.int32) # (N,) + + if k >= self._count: + order = np.argsort(dists) + else: + part = np.argpartition(dists, k)[:k] + order = part[np.argsort(dists[part])] + + return active_ids[order], dists[order] + + def save(self) -> None: + """Flush binary store to disk.""" + if self._matrix is None or self._count == 0: + return + self._dir.mkdir(parents=True, exist_ok=True) + # Write only the occupied portion of the pre-allocated matrix + active_matrix = self._matrix[: self._count] + mm = np.memmap( + str(self._bin_path), + dtype=np.uint8, + mode="w+", + shape=active_matrix.shape, + ) + mm[:] = active_matrix + mm.flush() + del mm + np.save(str(self._ids_path), self._ids[: self._count]) + + def load(self) -> None: + """Reload binary store from disk.""" + ids = np.load(str(self._ids_path)) + n = len(ids) + if n == 0: + return + mm = np.memmap( + str(self._bin_path), + dtype=np.uint8, + mode="r", + shape=(n, self._packed_bytes), + ) + self._matrix = np.array(mm) # copy into RAM for mutation support + del mm + self._ids = ids.astype(np.int64) + self._count = n + + def __len__(self) -> int: + return self._count diff --git a/codex-lens-v2/src/codexlens/core/factory.py b/codex-lens-v2/src/codexlens/core/factory.py new file mode 100644 index 00000000..c1d0f2a8 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/factory.py @@ -0,0 +1,116 @@ +from __future__ import annotations + +import logging +from pathlib import Path + +from codexlens.config import Config +from codexlens.core.base import BaseANNIndex, BaseBinaryIndex + +logger = logging.getLogger(__name__) + +try: + import faiss as _faiss # noqa: F401 + _FAISS_AVAILABLE = True +except ImportError: + _FAISS_AVAILABLE = False + +try: + import hnswlib as _hnswlib # noqa: F401 + _HNSWLIB_AVAILABLE = True +except ImportError: + _HNSWLIB_AVAILABLE = False + + +def _has_faiss_gpu() -> bool: + """Check whether faiss-gpu is available (has GPU resources).""" + if not _FAISS_AVAILABLE: + return False + try: + import faiss + res = faiss.StandardGpuResources() # noqa: F841 + return True + except (AttributeError, RuntimeError): + return False + + +def create_ann_index(path: str | Path, dim: int, config: Config) -> BaseANNIndex: + """Create an ANN index based on config.ann_backend. + + Fallback chain for 'auto': faiss-gpu -> faiss-cpu -> hnswlib. + + Args: + path: directory for index persistence + dim: vector dimensionality + config: project configuration + + Returns: + A BaseANNIndex implementation + + Raises: + ImportError: if no suitable backend is available + """ + backend = config.ann_backend + + if backend == "faiss": + from codexlens.core.faiss_index import FAISSANNIndex + return FAISSANNIndex(path, dim, config) + + if backend == "hnswlib": + from codexlens.core.index import ANNIndex + return ANNIndex(path, dim, config) + + # auto: try faiss first, then hnswlib + if _FAISS_AVAILABLE: + from codexlens.core.faiss_index import FAISSANNIndex + gpu_tag = " (GPU available)" if _has_faiss_gpu() else " (CPU)" + logger.info("Auto-selected FAISS ANN backend%s", gpu_tag) + return FAISSANNIndex(path, dim, config) + + if _HNSWLIB_AVAILABLE: + from codexlens.core.index import ANNIndex + logger.info("Auto-selected hnswlib ANN backend") + return ANNIndex(path, dim, config) + + raise ImportError( + "No ANN backend available. Install faiss-cpu, faiss-gpu, or hnswlib." + ) + + +def create_binary_index( + path: str | Path, dim: int, config: Config +) -> BaseBinaryIndex: + """Create a binary index based on config.binary_backend. + + Fallback chain for 'auto': faiss -> numpy BinaryStore. + + Args: + path: directory for index persistence + dim: vector dimensionality + config: project configuration + + Returns: + A BaseBinaryIndex implementation + + Raises: + ImportError: if no suitable backend is available + """ + backend = config.binary_backend + + if backend == "faiss": + from codexlens.core.faiss_index import FAISSBinaryIndex + return FAISSBinaryIndex(path, dim, config) + + if backend == "hnswlib": + from codexlens.core.binary import BinaryStore + return BinaryStore(path, dim, config) + + # auto: try faiss first, then numpy-based BinaryStore + if _FAISS_AVAILABLE: + from codexlens.core.faiss_index import FAISSBinaryIndex + logger.info("Auto-selected FAISS binary backend") + return FAISSBinaryIndex(path, dim, config) + + # numpy BinaryStore is always available (no extra deps) + from codexlens.core.binary import BinaryStore + logger.info("Auto-selected numpy BinaryStore backend") + return BinaryStore(path, dim, config) diff --git a/codex-lens-v2/src/codexlens/core/faiss_index.py b/codex-lens-v2/src/codexlens/core/faiss_index.py new file mode 100644 index 00000000..e076af50 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/faiss_index.py @@ -0,0 +1,275 @@ +from __future__ import annotations + +import logging +import math +import threading +from pathlib import Path + +import numpy as np + +from codexlens.config import Config +from codexlens.core.base import BaseANNIndex, BaseBinaryIndex + +logger = logging.getLogger(__name__) + +try: + import faiss + _FAISS_AVAILABLE = True +except ImportError: + faiss = None # type: ignore[assignment] + _FAISS_AVAILABLE = False + + +def _try_gpu_index(index: "faiss.Index") -> "faiss.Index": + """Transfer a FAISS index to GPU if faiss-gpu is available. + + Returns the GPU index on success, or the original CPU index on failure. + """ + try: + res = faiss.StandardGpuResources() + gpu_index = faiss.index_cpu_to_gpu(res, 0, index) + logger.info("FAISS index transferred to GPU 0") + return gpu_index + except (AttributeError, RuntimeError) as exc: + logger.debug("GPU transfer unavailable, staying on CPU: %s", exc) + return index + + +def _to_cpu_for_save(index: "faiss.Index") -> "faiss.Index": + """Convert a GPU index back to CPU for serialization.""" + try: + return faiss.index_gpu_to_cpu(index) + except (AttributeError, RuntimeError): + return index + + +class FAISSANNIndex(BaseANNIndex): + """FAISS-based ANN index using IndexHNSWFlat with optional GPU. + + Uses Inner Product space with L2-normalized vectors for cosine similarity. + Thread-safe via RLock. + """ + + def __init__(self, path: str | Path, dim: int, config: Config) -> None: + if not _FAISS_AVAILABLE: + raise ImportError( + "faiss is required. Install with: pip install faiss-cpu " + "or pip install faiss-gpu" + ) + + self._path = Path(path) + self._index_path = self._path / "faiss_ann.index" + self._dim = dim + self._config = config + self._lock = threading.RLock() + self._index: faiss.Index | None = None + + def _ensure_loaded(self) -> None: + """Load or initialize the index (caller holds lock).""" + if self._index is not None: + return + self.load() + + def load(self) -> None: + """Load index from disk or initialize a fresh one.""" + with self._lock: + if self._index_path.exists(): + idx = faiss.read_index(str(self._index_path)) + logger.debug( + "Loaded FAISS ANN index from %s (%d items)", + self._index_path, idx.ntotal, + ) + else: + # HNSW with flat storage, M=32 by default + m = self._config.hnsw_M + idx = faiss.IndexHNSWFlat(self._dim, m, faiss.METRIC_INNER_PRODUCT) + idx.hnsw.efConstruction = self._config.hnsw_ef_construction + idx.hnsw.efSearch = self._config.hnsw_ef + logger.debug( + "Initialized fresh FAISS HNSW index (dim=%d, M=%d)", + self._dim, m, + ) + self._index = _try_gpu_index(idx) + + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add L2-normalized float32 vectors. + + Vectors are normalized before insertion so that Inner Product + distance equals cosine similarity. + + Args: + ids: shape (N,) int64 -- currently unused by FAISS flat index + but kept for API compatibility. FAISS uses sequential IDs. + vectors: shape (N, dim) float32 + """ + if len(ids) == 0: + return + + vecs = np.ascontiguousarray(vectors, dtype=np.float32) + # Normalize for cosine similarity via Inner Product + faiss.normalize_L2(vecs) + + with self._lock: + self._ensure_loaded() + self._index.add(vecs) + + def fine_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search for nearest neighbors. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results; defaults to config.ann_top_k + + Returns: + (ids, distances) as numpy arrays. For IP space, higher = more + similar, but distances are returned as-is for consumer handling. + """ + k = top_k if top_k is not None else self._config.ann_top_k + + with self._lock: + self._ensure_loaded() + + count = self._index.ntotal + if count == 0: + return np.array([], dtype=np.int64), np.array([], dtype=np.float32) + + k = min(k, count) + # Set efSearch for HNSW accuracy + try: + self._index.hnsw.efSearch = max(self._config.hnsw_ef, k) + except AttributeError: + pass # GPU index may not expose hnsw attribute directly + + q = np.ascontiguousarray(query_vec, dtype=np.float32).reshape(1, -1) + faiss.normalize_L2(q) + distances, labels = self._index.search(q, k) + return labels[0].astype(np.int64), distances[0].astype(np.float32) + + def save(self) -> None: + """Save index to disk.""" + with self._lock: + if self._index is None: + return + self._path.mkdir(parents=True, exist_ok=True) + cpu_index = _to_cpu_for_save(self._index) + faiss.write_index(cpu_index, str(self._index_path)) + + def __len__(self) -> int: + with self._lock: + if self._index is None: + return 0 + return self._index.ntotal + + +class FAISSBinaryIndex(BaseBinaryIndex): + """FAISS-based binary index using IndexBinaryFlat for Hamming distance. + + Vectors are binary-quantized (sign bit) before insertion. + Thread-safe via RLock. + """ + + def __init__(self, path: str | Path, dim: int, config: Config) -> None: + if not _FAISS_AVAILABLE: + raise ImportError( + "faiss is required. Install with: pip install faiss-cpu " + "or pip install faiss-gpu" + ) + + self._path = Path(path) + self._index_path = self._path / "faiss_binary.index" + self._dim = dim + self._config = config + self._packed_bytes = math.ceil(dim / 8) + self._lock = threading.RLock() + self._index: faiss.IndexBinary | None = None + + def _ensure_loaded(self) -> None: + if self._index is not None: + return + self.load() + + def _quantize(self, vectors: np.ndarray) -> np.ndarray: + """Convert float32 vectors (N, dim) to packed uint8 (N, packed_bytes).""" + binary = (vectors > 0).astype(np.uint8) + return np.packbits(binary, axis=1) + + def _quantize_single(self, vec: np.ndarray) -> np.ndarray: + """Convert a single float32 vector (dim,) to packed uint8 (1, packed_bytes).""" + binary = (vec > 0).astype(np.uint8) + return np.packbits(binary).reshape(1, -1) + + def load(self) -> None: + """Load binary index from disk or initialize a fresh one.""" + with self._lock: + if self._index_path.exists(): + idx = faiss.read_index_binary(str(self._index_path)) + logger.debug( + "Loaded FAISS binary index from %s (%d items)", + self._index_path, idx.ntotal, + ) + else: + # IndexBinaryFlat takes dimension in bits + idx = faiss.IndexBinaryFlat(self._dim) + logger.debug( + "Initialized fresh FAISS binary index (dim_bits=%d)", self._dim, + ) + self._index = idx + + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add float32 vectors (binary-quantized internally). + + Args: + ids: shape (N,) int64 -- kept for API compatibility + vectors: shape (N, dim) float32 + """ + if len(ids) == 0: + return + + packed = self._quantize(vectors) + packed = np.ascontiguousarray(packed, dtype=np.uint8) + + with self._lock: + self._ensure_loaded() + self._index.add(packed) + + def coarse_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search by Hamming distance. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results; defaults to config.binary_top_k + + Returns: + (ids, distances) sorted ascending by Hamming distance + """ + with self._lock: + self._ensure_loaded() + + if self._index.ntotal == 0: + return np.array([], dtype=np.int64), np.array([], dtype=np.int32) + + k = top_k if top_k is not None else self._config.binary_top_k + k = min(k, self._index.ntotal) + + q = self._quantize_single(query_vec) + q = np.ascontiguousarray(q, dtype=np.uint8) + distances, labels = self._index.search(q, k) + return labels[0].astype(np.int64), distances[0].astype(np.int32) + + def save(self) -> None: + """Save binary index to disk.""" + with self._lock: + if self._index is None: + return + self._path.mkdir(parents=True, exist_ok=True) + faiss.write_index_binary(self._index, str(self._index_path)) + + def __len__(self) -> int: + with self._lock: + if self._index is None: + return 0 + return self._index.ntotal diff --git a/codex-lens-v2/src/codexlens/core/index.py b/codex-lens-v2/src/codexlens/core/index.py new file mode 100644 index 00000000..dc92d581 --- /dev/null +++ b/codex-lens-v2/src/codexlens/core/index.py @@ -0,0 +1,136 @@ +from __future__ import annotations + +import logging +import threading +from pathlib import Path + +import numpy as np + +from codexlens.config import Config +from codexlens.core.base import BaseANNIndex + +logger = logging.getLogger(__name__) + +try: + import hnswlib + _HNSWLIB_AVAILABLE = True +except ImportError: + _HNSWLIB_AVAILABLE = False + + +class ANNIndex(BaseANNIndex): + """HNSW-based approximate nearest neighbor index. + + Lazy-loads on first use, thread-safe via RLock. + """ + + def __init__(self, path: str | Path, dim: int, config: Config) -> None: + if not _HNSWLIB_AVAILABLE: + raise ImportError("hnswlib is required. Install with: pip install hnswlib") + + self._path = Path(path) + self._hnsw_path = self._path / "ann_index.hnsw" + self._dim = dim + self._config = config + self._lock = threading.RLock() + self._index: hnswlib.Index | None = None + + # ------------------------------------------------------------------ + # Internal helpers + # ------------------------------------------------------------------ + + def _ensure_loaded(self) -> None: + """Load or initialize the index (caller holds lock).""" + if self._index is not None: + return + self.load() + + # ------------------------------------------------------------------ + # Public API + # ------------------------------------------------------------------ + + def load(self) -> None: + """Load index from disk or initialize a fresh one.""" + with self._lock: + idx = hnswlib.Index(space="cosine", dim=self._dim) + if self._hnsw_path.exists(): + idx.load_index(str(self._hnsw_path), max_elements=0) + idx.set_ef(self._config.hnsw_ef) + logger.debug("Loaded HNSW index from %s (%d items)", self._hnsw_path, idx.get_current_count()) + else: + idx.init_index( + max_elements=1000, + ef_construction=self._config.hnsw_ef_construction, + M=self._config.hnsw_M, + ) + idx.set_ef(self._config.hnsw_ef) + logger.debug("Initialized fresh HNSW index (dim=%d)", self._dim) + self._index = idx + + def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: + """Add float32 vectors. + + Does NOT call save() internally -- callers must call save() + explicitly after batch indexing. + + Args: + ids: shape (N,) int64 + vectors: shape (N, dim) float32 + """ + if len(ids) == 0: + return + + vecs = np.ascontiguousarray(vectors, dtype=np.float32) + + with self._lock: + self._ensure_loaded() + # Expand capacity if needed + current = self._index.get_current_count() + max_el = self._index.get_max_elements() + needed = current + len(ids) + if needed > max_el: + new_cap = max(max_el * 2, needed + 100) + self._index.resize_index(new_cap) + self._index.add_items(vecs, ids.astype(np.int64)) + + def fine_search( + self, query_vec: np.ndarray, top_k: int | None = None + ) -> tuple[np.ndarray, np.ndarray]: + """Search for nearest neighbors. + + Args: + query_vec: float32 vector of shape (dim,) + top_k: number of results; defaults to config.ann_top_k + + Returns: + (ids, distances) as numpy arrays + """ + k = top_k if top_k is not None else self._config.ann_top_k + + with self._lock: + self._ensure_loaded() + + count = self._index.get_current_count() + if count == 0: + return np.array([], dtype=np.int64), np.array([], dtype=np.float32) + + k = min(k, count) + self._index.set_ef(max(self._config.hnsw_ef, k)) + + q = np.ascontiguousarray(query_vec, dtype=np.float32).reshape(1, -1) + labels, distances = self._index.knn_query(q, k=k) + return labels[0].astype(np.int64), distances[0].astype(np.float32) + + def save(self) -> None: + """Save index to disk (caller may or may not hold lock).""" + with self._lock: + if self._index is None: + return + self._path.mkdir(parents=True, exist_ok=True) + self._index.save_index(str(self._hnsw_path)) + + def __len__(self) -> int: + with self._lock: + if self._index is None: + return 0 + return self._index.get_current_count() diff --git a/codex-lens-v2/src/codexlens/embed/__init__.py b/codex-lens-v2/src/codexlens/embed/__init__.py new file mode 100644 index 00000000..43df6b7f --- /dev/null +++ b/codex-lens-v2/src/codexlens/embed/__init__.py @@ -0,0 +1,4 @@ +from .base import BaseEmbedder +from .local import FastEmbedEmbedder, EMBED_PROFILES + +__all__ = ["BaseEmbedder", "FastEmbedEmbedder", "EMBED_PROFILES"] diff --git a/codex-lens-v2/src/codexlens/embed/base.py b/codex-lens-v2/src/codexlens/embed/base.py new file mode 100644 index 00000000..7e78e75f --- /dev/null +++ b/codex-lens-v2/src/codexlens/embed/base.py @@ -0,0 +1,13 @@ +from __future__ import annotations +from abc import ABC, abstractmethod +import numpy as np + + +class BaseEmbedder(ABC): + @abstractmethod + def embed_single(self, text: str) -> np.ndarray: + """Embed a single text, returns float32 ndarray shape (dim,).""" + + @abstractmethod + def embed_batch(self, texts: list[str]) -> list[np.ndarray]: + """Embed a list of texts, returns list of float32 ndarrays.""" diff --git a/codex-lens-v2/src/codexlens/embed/local.py b/codex-lens-v2/src/codexlens/embed/local.py new file mode 100644 index 00000000..8e314347 --- /dev/null +++ b/codex-lens-v2/src/codexlens/embed/local.py @@ -0,0 +1,53 @@ +from __future__ import annotations + +import numpy as np + +from ..config import Config +from .base import BaseEmbedder + +EMBED_PROFILES = { + "small": "BAAI/bge-small-en-v1.5", # 384d + "base": "BAAI/bge-base-en-v1.5", # 768d + "large": "BAAI/bge-large-en-v1.5", # 1024d + "code": "jinaai/jina-embeddings-v2-base-code", # 768d +} + + +class FastEmbedEmbedder(BaseEmbedder): + """Embedder backed by fastembed.TextEmbedding with lazy model loading.""" + + def __init__(self, config: Config) -> None: + self._config = config + self._model = None + + def _load(self) -> None: + """Lazy-load the fastembed TextEmbedding model on first use.""" + if self._model is not None: + return + from fastembed import TextEmbedding + providers = self._config.resolve_embed_providers() + try: + self._model = TextEmbedding( + model_name=self._config.embed_model, + providers=providers, + ) + except TypeError: + # Older fastembed versions may not accept providers kwarg + self._model = TextEmbedding(model_name=self._config.embed_model) + + def embed_single(self, text: str) -> np.ndarray: + """Embed a single text, returns float32 ndarray of shape (dim,).""" + self._load() + result = list(self._model.embed([text])) + return result[0].astype(np.float32) + + def embed_batch(self, texts: list[str]) -> list[np.ndarray]: + """Embed a list of texts in batches, returns list of float32 ndarrays.""" + self._load() + batch_size = self._config.embed_batch_size + results: list[np.ndarray] = [] + for start in range(0, len(texts), batch_size): + batch = texts[start : start + batch_size] + for vec in self._model.embed(batch): + results.append(vec.astype(np.float32)) + return results diff --git a/codex-lens-v2/src/codexlens/indexing/__init__.py b/codex-lens-v2/src/codexlens/indexing/__init__.py new file mode 100644 index 00000000..16a9a35b --- /dev/null +++ b/codex-lens-v2/src/codexlens/indexing/__init__.py @@ -0,0 +1,5 @@ +from __future__ import annotations + +from .pipeline import IndexingPipeline, IndexStats + +__all__ = ["IndexingPipeline", "IndexStats"] diff --git a/codex-lens-v2/src/codexlens/indexing/pipeline.py b/codex-lens-v2/src/codexlens/indexing/pipeline.py new file mode 100644 index 00000000..8e11c4c7 --- /dev/null +++ b/codex-lens-v2/src/codexlens/indexing/pipeline.py @@ -0,0 +1,277 @@ +"""Three-stage parallel indexing pipeline: chunk -> embed -> index. + +Uses threading.Thread with queue.Queue for producer-consumer handoff. +The GIL is acceptable because embedding (onnxruntime) releases it in C extensions. +""" +from __future__ import annotations + +import logging +import queue +import threading +import time +from dataclasses import dataclass +from pathlib import Path + +import numpy as np + +from codexlens.config import Config +from codexlens.core.binary import BinaryStore +from codexlens.core.index import ANNIndex +from codexlens.embed.base import BaseEmbedder +from codexlens.search.fts import FTSEngine + +logger = logging.getLogger(__name__) + +# Sentinel value to signal worker shutdown +_SENTINEL = None + +# Defaults for chunking (can be overridden via index_files kwargs) +_DEFAULT_MAX_CHUNK_CHARS = 800 +_DEFAULT_CHUNK_OVERLAP = 100 + + +@dataclass +class IndexStats: + """Statistics returned after indexing completes.""" + files_processed: int = 0 + chunks_created: int = 0 + duration_seconds: float = 0.0 + + +class IndexingPipeline: + """Parallel 3-stage indexing pipeline with queue-based handoff. + + Stage 1 (main thread): Read files, chunk text, push to embed_queue. + Stage 2 (embed worker): Pull text batches, call embed_batch(), push vectors to index_queue. + Stage 3 (index worker): Pull vectors+ids, call BinaryStore.add(), ANNIndex.add(), FTS.add_documents(). + + After all stages complete, save() is called on BinaryStore and ANNIndex exactly once. + """ + + def __init__( + self, + embedder: BaseEmbedder, + binary_store: BinaryStore, + ann_index: ANNIndex, + fts: FTSEngine, + config: Config, + ) -> None: + self._embedder = embedder + self._binary_store = binary_store + self._ann_index = ann_index + self._fts = fts + self._config = config + + def index_files( + self, + files: list[Path], + *, + root: Path | None = None, + max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS, + chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP, + max_file_size: int = 50_000, + ) -> IndexStats: + """Run the 3-stage pipeline on the given files. + + Args: + files: List of file paths to index. + root: Optional root for computing relative paths. If None, uses + each file's absolute path as its identifier. + max_chunk_chars: Maximum characters per chunk. + chunk_overlap: Character overlap between consecutive chunks. + max_file_size: Skip files larger than this (bytes). + + Returns: + IndexStats with counts and timing. + """ + if not files: + return IndexStats() + + t0 = time.monotonic() + + embed_queue: queue.Queue = queue.Queue(maxsize=4) + index_queue: queue.Queue = queue.Queue(maxsize=4) + + # Track errors from workers + worker_errors: list[Exception] = [] + error_lock = threading.Lock() + + def _record_error(exc: Exception) -> None: + with error_lock: + worker_errors.append(exc) + + # --- Start workers --- + embed_thread = threading.Thread( + target=self._embed_worker, + args=(embed_queue, index_queue, _record_error), + daemon=True, + name="indexing-embed", + ) + index_thread = threading.Thread( + target=self._index_worker, + args=(index_queue, _record_error), + daemon=True, + name="indexing-index", + ) + embed_thread.start() + index_thread.start() + + # --- Stage 1: chunk files (main thread) --- + chunk_id = 0 + files_processed = 0 + chunks_created = 0 + + for fpath in files: + try: + if fpath.stat().st_size > max_file_size: + continue + text = fpath.read_text(encoding="utf-8", errors="replace") + except Exception as exc: + logger.debug("Skipping %s: %s", fpath, exc) + continue + + rel_path = str(fpath.relative_to(root)) if root else str(fpath) + file_chunks = self._chunk_text(text, rel_path, max_chunk_chars, chunk_overlap) + + if not file_chunks: + continue + + files_processed += 1 + + # Assign sequential IDs and push batch to embed queue + batch_ids = [] + batch_texts = [] + batch_paths = [] + for chunk_text, path in file_chunks: + batch_ids.append(chunk_id) + batch_texts.append(chunk_text) + batch_paths.append(path) + chunk_id += 1 + + chunks_created += len(batch_ids) + embed_queue.put((batch_ids, batch_texts, batch_paths)) + + # Signal embed worker: no more data + embed_queue.put(_SENTINEL) + + # Wait for workers to finish + embed_thread.join() + index_thread.join() + + # --- Final flush --- + self._binary_store.save() + self._ann_index.save() + + duration = time.monotonic() - t0 + stats = IndexStats( + files_processed=files_processed, + chunks_created=chunks_created, + duration_seconds=round(duration, 2), + ) + + logger.info( + "Indexing complete: %d files, %d chunks in %.1fs", + stats.files_processed, + stats.chunks_created, + stats.duration_seconds, + ) + + # Raise first worker error if any occurred + if worker_errors: + raise worker_errors[0] + + return stats + + # ------------------------------------------------------------------ + # Workers + # ------------------------------------------------------------------ + + def _embed_worker( + self, + in_q: queue.Queue, + out_q: queue.Queue, + on_error: callable, + ) -> None: + """Stage 2: Pull chunk batches, embed, push (ids, vecs, docs) to index queue.""" + try: + while True: + item = in_q.get() + if item is _SENTINEL: + break + + batch_ids, batch_texts, batch_paths = item + try: + vecs = self._embedder.embed_batch(batch_texts) + vec_array = np.array(vecs, dtype=np.float32) + id_array = np.array(batch_ids, dtype=np.int64) + out_q.put((id_array, vec_array, batch_texts, batch_paths)) + except Exception as exc: + logger.error("Embed worker error: %s", exc) + on_error(exc) + finally: + # Signal index worker: no more data + out_q.put(_SENTINEL) + + def _index_worker( + self, + in_q: queue.Queue, + on_error: callable, + ) -> None: + """Stage 3: Pull (ids, vecs, texts, paths), write to stores.""" + while True: + item = in_q.get() + if item is _SENTINEL: + break + + id_array, vec_array, texts, paths = item + try: + self._binary_store.add(id_array, vec_array) + self._ann_index.add(id_array, vec_array) + + fts_docs = [ + (int(id_array[i]), paths[i], texts[i]) + for i in range(len(id_array)) + ] + self._fts.add_documents(fts_docs) + except Exception as exc: + logger.error("Index worker error: %s", exc) + on_error(exc) + + # ------------------------------------------------------------------ + # Chunking + # ------------------------------------------------------------------ + + @staticmethod + def _chunk_text( + text: str, + path: str, + max_chars: int, + overlap: int, + ) -> list[tuple[str, str]]: + """Split file text into overlapping chunks. + + Returns list of (chunk_text, path) tuples. + """ + if not text.strip(): + return [] + + chunks: list[tuple[str, str]] = [] + lines = text.splitlines(keepends=True) + current: list[str] = [] + current_len = 0 + + for line in lines: + if current_len + len(line) > max_chars and current: + chunk = "".join(current) + chunks.append((chunk, path)) + # overlap: keep last N characters + tail = "".join(current)[-overlap:] + current = [tail] if tail else [] + current_len = len(tail) + current.append(line) + current_len += len(line) + + if current: + chunks.append(("".join(current), path)) + + return chunks diff --git a/codex-lens-v2/src/codexlens/rerank/__init__.py b/codex-lens-v2/src/codexlens/rerank/__init__.py new file mode 100644 index 00000000..2e2832fd --- /dev/null +++ b/codex-lens-v2/src/codexlens/rerank/__init__.py @@ -0,0 +1,5 @@ +from .base import BaseReranker +from .local import FastEmbedReranker +from .api import APIReranker + +__all__ = ["BaseReranker", "FastEmbedReranker", "APIReranker"] diff --git a/codex-lens-v2/src/codexlens/rerank/api.py b/codex-lens-v2/src/codexlens/rerank/api.py new file mode 100644 index 00000000..c56a221f --- /dev/null +++ b/codex-lens-v2/src/codexlens/rerank/api.py @@ -0,0 +1,103 @@ +from __future__ import annotations + +import logging +import time + +import httpx + +from codexlens.config import Config +from .base import BaseReranker + +logger = logging.getLogger(__name__) + + +class APIReranker(BaseReranker): + """Reranker backed by a remote HTTP API (SiliconFlow/Cohere/Jina format).""" + + def __init__(self, config: Config) -> None: + self._config = config + self._client = httpx.Client( + headers={ + "Authorization": f"Bearer {config.reranker_api_key}", + "Content-Type": "application/json", + }, + ) + + def score_pairs(self, query: str, documents: list[str]) -> list[float]: + if not documents: + return [] + max_tokens = self._config.reranker_api_max_tokens_per_batch + batches = self._split_batches(documents, max_tokens) + scores = [0.0] * len(documents) + for batch in batches: + batch_scores = self._call_api_with_retry(query, batch) + for orig_idx, score in batch_scores.items(): + scores[orig_idx] = score + return scores + + def _split_batches( + self, documents: list[str], max_tokens: int + ) -> list[list[tuple[int, str]]]: + batches: list[list[tuple[int, str]]] = [] + current_batch: list[tuple[int, str]] = [] + current_tokens = 0 + + for idx, text in enumerate(documents): + doc_tokens = len(text) // 4 + if current_tokens + doc_tokens > max_tokens and current_batch: + batches.append(current_batch) + current_batch = [] + current_tokens = 0 + current_batch.append((idx, text)) + current_tokens += doc_tokens + + if current_batch: + batches.append(current_batch) + + return batches + + def _call_api_with_retry( + self, + query: str, + docs: list[tuple[int, str]], + max_retries: int = 3, + ) -> dict[int, float]: + url = self._config.reranker_api_url.rstrip("/") + "/rerank" + payload = { + "model": self._config.reranker_api_model, + "query": query, + "documents": [t for _, t in docs], + } + + last_exc: Exception | None = None + for attempt in range(max_retries): + try: + response = self._client.post(url, json=payload) + except Exception as exc: + last_exc = exc + time.sleep((2 ** attempt) * 0.5) + continue + + if response.status_code in (429, 503): + logger.warning( + "API reranker returned HTTP %s (attempt %d/%d), retrying...", + response.status_code, + attempt + 1, + max_retries, + ) + time.sleep((2 ** attempt) * 0.5) + continue + + response.raise_for_status() + data = response.json() + results = data.get("results", []) + scores: dict[int, float] = {} + for item in results: + local_idx = int(item["index"]) + orig_idx = docs[local_idx][0] + scores[orig_idx] = float(item["relevance_score"]) + return scores + + raise RuntimeError( + f"API reranker failed after {max_retries} attempts. Last error: {last_exc}" + ) diff --git a/codex-lens-v2/src/codexlens/rerank/base.py b/codex-lens-v2/src/codexlens/rerank/base.py new file mode 100644 index 00000000..5edaf6db --- /dev/null +++ b/codex-lens-v2/src/codexlens/rerank/base.py @@ -0,0 +1,8 @@ +from __future__ import annotations +from abc import ABC, abstractmethod + + +class BaseReranker(ABC): + @abstractmethod + def score_pairs(self, query: str, documents: list[str]) -> list[float]: + """Score (query, doc) pairs. Returns list of floats same length as documents.""" diff --git a/codex-lens-v2/src/codexlens/rerank/local.py b/codex-lens-v2/src/codexlens/rerank/local.py new file mode 100644 index 00000000..f14ea5cd --- /dev/null +++ b/codex-lens-v2/src/codexlens/rerank/local.py @@ -0,0 +1,25 @@ +from __future__ import annotations + +from codexlens.config import Config +from .base import BaseReranker + + +class FastEmbedReranker(BaseReranker): + """Local reranker backed by fastembed TextCrossEncoder.""" + + def __init__(self, config: Config) -> None: + self._config = config + self._model = None + + def _load(self) -> None: + if self._model is None: + from fastembed.rerank.cross_encoder import TextCrossEncoder + self._model = TextCrossEncoder(model_name=self._config.reranker_model) + + def score_pairs(self, query: str, documents: list[str]) -> list[float]: + self._load() + results = list(self._model.rerank(query, documents)) + scores = [0.0] * len(documents) + for r in results: + scores[r.index] = float(r.score) + return scores diff --git a/codex-lens-v2/src/codexlens/search/__init__.py b/codex-lens-v2/src/codexlens/search/__init__.py new file mode 100644 index 00000000..749b94b9 --- /dev/null +++ b/codex-lens-v2/src/codexlens/search/__init__.py @@ -0,0 +1,8 @@ +from .fts import FTSEngine +from .fusion import reciprocal_rank_fusion, detect_query_intent, QueryIntent, DEFAULT_WEIGHTS +from .pipeline import SearchPipeline, SearchResult + +__all__ = [ + "FTSEngine", "reciprocal_rank_fusion", "detect_query_intent", + "QueryIntent", "DEFAULT_WEIGHTS", "SearchPipeline", "SearchResult", +] diff --git a/codex-lens-v2/src/codexlens/search/fts.py b/codex-lens-v2/src/codexlens/search/fts.py new file mode 100644 index 00000000..fdfe4a4d --- /dev/null +++ b/codex-lens-v2/src/codexlens/search/fts.py @@ -0,0 +1,69 @@ +from __future__ import annotations + +import sqlite3 +from pathlib import Path + + +class FTSEngine: + def __init__(self, db_path: str | Path) -> None: + self._conn = sqlite3.connect(str(db_path), check_same_thread=False) + self._conn.execute( + "CREATE VIRTUAL TABLE IF NOT EXISTS docs " + "USING fts5(content, tokenize='porter unicode61')" + ) + self._conn.execute( + "CREATE TABLE IF NOT EXISTS docs_meta " + "(id INTEGER PRIMARY KEY, path TEXT)" + ) + self._conn.commit() + + def add_documents(self, docs: list[tuple[int, str, str]]) -> None: + """Add documents in batch. docs: list of (id, path, content).""" + if not docs: + return + self._conn.executemany( + "INSERT OR REPLACE INTO docs_meta (id, path) VALUES (?, ?)", + [(doc_id, path) for doc_id, path, content in docs], + ) + self._conn.executemany( + "INSERT OR REPLACE INTO docs (rowid, content) VALUES (?, ?)", + [(doc_id, content) for doc_id, path, content in docs], + ) + self._conn.commit() + + def exact_search(self, query: str, top_k: int = 50) -> list[tuple[int, float]]: + """FTS5 MATCH query, return (id, bm25_score) sorted by score descending.""" + try: + rows = self._conn.execute( + "SELECT rowid, bm25(docs) AS score FROM docs " + "WHERE docs MATCH ? ORDER BY score LIMIT ?", + (query, top_k), + ).fetchall() + except sqlite3.OperationalError: + return [] + # bm25 in SQLite FTS5 returns negative values (lower = better match) + # Negate so higher is better + return [(int(row[0]), -float(row[1])) for row in rows] + + def fuzzy_search(self, query: str, top_k: int = 50) -> list[tuple[int, float]]: + """Prefix search: each token + '*', return (id, score) sorted descending.""" + tokens = query.strip().split() + if not tokens: + return [] + prefix_query = " ".join(t + "*" for t in tokens) + try: + rows = self._conn.execute( + "SELECT rowid, bm25(docs) AS score FROM docs " + "WHERE docs MATCH ? ORDER BY score LIMIT ?", + (prefix_query, top_k), + ).fetchall() + except sqlite3.OperationalError: + return [] + return [(int(row[0]), -float(row[1])) for row in rows] + + def get_content(self, doc_id: int) -> str: + """Retrieve content for a doc_id.""" + row = self._conn.execute( + "SELECT content FROM docs WHERE rowid = ?", (doc_id,) + ).fetchone() + return row[0] if row else "" diff --git a/codex-lens-v2/src/codexlens/search/fusion.py b/codex-lens-v2/src/codexlens/search/fusion.py new file mode 100644 index 00000000..a51d7534 --- /dev/null +++ b/codex-lens-v2/src/codexlens/search/fusion.py @@ -0,0 +1,106 @@ +from __future__ import annotations + +import re +from enum import Enum + +DEFAULT_WEIGHTS: dict[str, float] = { + "exact": 0.25, + "fuzzy": 0.10, + "vector": 0.50, + "graph": 0.15, +} + +_CODE_CAMEL_RE = re.compile(r"[a-z][A-Z]") +_CODE_SNAKE_RE = re.compile(r"\b[a-z_]+_[a-z_]+\b") +_CODE_SYMBOLS_RE = re.compile(r"[.\[\](){}]|->|::") +_CODE_KEYWORDS_RE = re.compile(r"\b(import|def|class|return|from|async|await|lambda|yield)\b") +_QUESTION_WORDS_RE = re.compile(r"\b(how|what|why|when|where|which|who|does|do|is|are|can|should)\b", re.IGNORECASE) + + +class QueryIntent(Enum): + CODE_SYMBOL = "code_symbol" + NATURAL_LANGUAGE = "natural" + MIXED = "mixed" + + +def detect_query_intent(query: str) -> QueryIntent: + """Detect whether query is a code symbol, natural language, or mixed.""" + words = query.strip().split() + word_count = len(words) + + code_signals = 0 + natural_signals = 0 + + if _CODE_CAMEL_RE.search(query): + code_signals += 2 + if _CODE_SNAKE_RE.search(query): + code_signals += 2 + if _CODE_SYMBOLS_RE.search(query): + code_signals += 2 + if _CODE_KEYWORDS_RE.search(query): + code_signals += 2 + if "`" in query: + code_signals += 1 + if word_count < 4: + code_signals += 1 + + if _QUESTION_WORDS_RE.search(query): + natural_signals += 2 + if word_count > 5: + natural_signals += 2 + if code_signals == 0 and word_count >= 3: + natural_signals += 1 + + if code_signals >= 2 and natural_signals == 0: + return QueryIntent.CODE_SYMBOL + if natural_signals >= 2 and code_signals == 0: + return QueryIntent.NATURAL_LANGUAGE + if code_signals >= 2 and natural_signals == 0: + return QueryIntent.CODE_SYMBOL + if natural_signals > code_signals: + return QueryIntent.NATURAL_LANGUAGE + if code_signals > natural_signals: + return QueryIntent.CODE_SYMBOL + return QueryIntent.MIXED + + +def get_adaptive_weights(intent: QueryIntent, base: dict | None = None) -> dict[str, float]: + """Return weights adapted to query intent.""" + weights = dict(base or DEFAULT_WEIGHTS) + if intent == QueryIntent.CODE_SYMBOL: + weights["exact"] = 0.45 + weights["vector"] = 0.35 + elif intent == QueryIntent.NATURAL_LANGUAGE: + weights["vector"] = 0.65 + weights["exact"] = 0.15 + # MIXED: use weights as-is + return weights + + +def reciprocal_rank_fusion( + results: dict[str, list[tuple[int, float]]], + weights: dict[str, float] | None = None, + k: int = 60, +) -> list[tuple[int, float]]: + """Fuse ranked result lists using Reciprocal Rank Fusion. + + results: {source_name: [(doc_id, score), ...]} each list sorted desc by score. + weights: weight per source (defaults to equal weight across all sources). + k: RRF constant (default 60). + Returns sorted list of (doc_id, fused_score) descending. + """ + if not results: + return [] + + sources = list(results.keys()) + if weights is None: + equal_w = 1.0 / len(sources) + weights = {s: equal_w for s in sources} + + scores: dict[int, float] = {} + for source, ranked_list in results.items(): + w = weights.get(source, 0.0) + for rank, (doc_id, _) in enumerate(ranked_list, start=1): + scores[doc_id] = scores.get(doc_id, 0.0) + w * (1.0 / (k + rank)) + + return sorted(scores.items(), key=lambda x: x[1], reverse=True) diff --git a/codex-lens-v2/src/codexlens/search/pipeline.py b/codex-lens-v2/src/codexlens/search/pipeline.py new file mode 100644 index 00000000..21e2810e --- /dev/null +++ b/codex-lens-v2/src/codexlens/search/pipeline.py @@ -0,0 +1,163 @@ +from __future__ import annotations + +import logging +from concurrent.futures import ThreadPoolExecutor +from dataclasses import dataclass + +import numpy as np + +from ..config import Config +from ..core import ANNIndex, BinaryStore +from ..embed import BaseEmbedder +from ..rerank import BaseReranker +from .fts import FTSEngine +from .fusion import ( + DEFAULT_WEIGHTS, + detect_query_intent, + get_adaptive_weights, + reciprocal_rank_fusion, +) + +_log = logging.getLogger(__name__) + + +@dataclass +class SearchResult: + id: int + path: str + score: float + snippet: str = "" + + +class SearchPipeline: + def __init__( + self, + embedder: BaseEmbedder, + binary_store: BinaryStore, + ann_index: ANNIndex, + reranker: BaseReranker, + fts: FTSEngine, + config: Config, + ) -> None: + self._embedder = embedder + self._binary_store = binary_store + self._ann_index = ann_index + self._reranker = reranker + self._fts = fts + self._config = config + + # -- Helper: vector search (binary coarse + ANN fine) ----------------- + + def _vector_search( + self, query_vec: np.ndarray + ) -> list[tuple[int, float]]: + """Run binary coarse search then ANN fine search and intersect.""" + cfg = self._config + + # Binary coarse search -> candidate_ids set + candidate_ids_list, _ = self._binary_store.coarse_search( + query_vec, top_k=cfg.binary_top_k + ) + candidate_ids = set(candidate_ids_list) + + # ANN fine search on full index, then intersect with binary candidates + ann_ids, ann_scores = self._ann_index.fine_search( + query_vec, top_k=cfg.ann_top_k + ) + # Keep only results that appear in binary candidates (2-stage funnel) + vector_results: list[tuple[int, float]] = [ + (int(doc_id), float(score)) + for doc_id, score in zip(ann_ids, ann_scores) + if int(doc_id) in candidate_ids + ] + # Fall back to full ANN results if intersection is empty + if not vector_results: + vector_results = [ + (int(doc_id), float(score)) + for doc_id, score in zip(ann_ids, ann_scores) + ] + return vector_results + + # -- Helper: FTS search (exact + fuzzy) ------------------------------ + + def _fts_search( + self, query: str + ) -> tuple[list[tuple[int, float]], list[tuple[int, float]]]: + """Run exact and fuzzy full-text search.""" + cfg = self._config + exact_results = self._fts.exact_search(query, top_k=cfg.fts_top_k) + fuzzy_results = self._fts.fuzzy_search(query, top_k=cfg.fts_top_k) + return exact_results, fuzzy_results + + # -- Main search entry point ----------------------------------------- + + def search(self, query: str, top_k: int | None = None) -> list[SearchResult]: + cfg = self._config + final_top_k = top_k if top_k is not None else cfg.reranker_top_k + + # 1. Detect intent -> adaptive weights + intent = detect_query_intent(query) + weights = get_adaptive_weights(intent, cfg.fusion_weights) + + # 2. Embed query + query_vec = self._embedder.embed([query])[0] + + # 3. Parallel vector + FTS search + vector_results: list[tuple[int, float]] = [] + exact_results: list[tuple[int, float]] = [] + fuzzy_results: list[tuple[int, float]] = [] + + with ThreadPoolExecutor(max_workers=2) as pool: + vec_future = pool.submit(self._vector_search, query_vec) + fts_future = pool.submit(self._fts_search, query) + + # Collect vector results + try: + vector_results = vec_future.result() + except Exception: + _log.warning("Vector search failed, using empty results", exc_info=True) + + # Collect FTS results + try: + exact_results, fuzzy_results = fts_future.result() + except Exception: + _log.warning("FTS search failed, using empty results", exc_info=True) + + # 4. RRF fusion + fusion_input: dict[str, list[tuple[int, float]]] = {} + if vector_results: + fusion_input["vector"] = vector_results + if exact_results: + fusion_input["exact"] = exact_results + if fuzzy_results: + fusion_input["fuzzy"] = fuzzy_results + + if not fusion_input: + return [] + + fused = reciprocal_rank_fusion(fusion_input, weights=weights, k=cfg.fusion_k) + + # 5. Rerank top candidates + rerank_ids = [doc_id for doc_id, _ in fused[:50]] + contents = [self._fts.get_content(doc_id) for doc_id in rerank_ids] + rerank_scores = self._reranker.score_pairs(query, contents) + + # 6. Sort by rerank score, build SearchResult list + ranked = sorted( + zip(rerank_ids, rerank_scores), key=lambda x: x[1], reverse=True + ) + + results: list[SearchResult] = [] + for doc_id, score in ranked[:final_top_k]: + path = self._fts._conn.execute( + "SELECT path FROM docs_meta WHERE id = ?", (doc_id,) + ).fetchone() + results.append( + SearchResult( + id=doc_id, + path=path[0] if path else "", + score=float(score), + snippet=self._fts.get_content(doc_id)[:200], + ) + ) + return results diff --git a/codex-lens-v2/tests/__init__.py b/codex-lens-v2/tests/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/codex-lens-v2/tests/integration/__init__.py b/codex-lens-v2/tests/integration/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/codex-lens-v2/tests/integration/conftest.py b/codex-lens-v2/tests/integration/conftest.py new file mode 100644 index 00000000..d8353a1c --- /dev/null +++ b/codex-lens-v2/tests/integration/conftest.py @@ -0,0 +1,108 @@ +import pytest +import numpy as np +import tempfile +from pathlib import Path + +from codexlens.config import Config +from codexlens.core import ANNIndex, BinaryStore +from codexlens.embed.base import BaseEmbedder +from codexlens.rerank.base import BaseReranker +from codexlens.search.fts import FTSEngine +from codexlens.search.pipeline import SearchPipeline + +# Test documents: 20 code snippets with id, path, content +TEST_DOCS = [ + (0, "auth.py", "def authenticate(user, password): return check_hash(password, user.hash)"), + (1, "auth.py", "def authorize(user, permission): return permission in user.roles"), + (2, "models.py", "class User: def __init__(self, name, email): self.name = name; self.email = email"), + (3, "models.py", "class Session: token = None; expires_at = None"), + (4, "middleware.py", "def auth_middleware(request): token = request.headers.get('Authorization')"), + (5, "utils.py", "def hash_password(password): import bcrypt; return bcrypt.hashpw(password)"), + (6, "config.py", "DATABASE_URL = os.environ.get('DATABASE_URL', 'sqlite:///db.sqlite3')"), + (7, "search.py", "def search_users(query): return User.objects.filter(name__icontains=query)"), + (8, "api.py", "def get_user(request, user_id): user = User.objects.get(id=user_id)"), + (9, "api.py", "def create_user(request): data = request.json(); user = User(**data)"), + (10, "tests.py", "def test_authenticate(): assert authenticate('admin', 'pass') is not None"), + (11, "tests.py", "def test_search(): results = search_users('alice'); assert len(results) > 0"), + (12, "router.py", "app.route('/users', methods=['GET'])(list_users)"), + (13, "router.py", "app.route('/login', methods=['POST'])(login_handler)"), + (14, "db.py", "def get_connection(): return sqlite3.connect(DATABASE_URL)"), + (15, "cache.py", "def cache_get(key): return redis_client.get(key)"), + (16, "cache.py", "def cache_set(key, value, ttl=3600): redis_client.setex(key, ttl, value)"), + (17, "errors.py", "class AuthError(Exception): status_code = 401"), + (18, "errors.py", "class NotFoundError(Exception): status_code = 404"), + (19, "validators.py", "def validate_email(email): return '@' in email and '.' in email.split('@')[1]"), +] + +DIM = 32 # Use small dim for fast tests + + +def make_stable_vec(doc_id: int, dim: int = DIM) -> np.ndarray: + """Generate a deterministic float32 vector for a given doc_id.""" + rng = np.random.default_rng(seed=doc_id) + vec = rng.standard_normal(dim).astype(np.float32) + vec /= np.linalg.norm(vec) + return vec + + +class MockEmbedder(BaseEmbedder): + """Returns stable deterministic vectors based on content hash.""" + + def embed_single(self, text: str) -> np.ndarray: + seed = hash(text) % (2**31) + rng = np.random.default_rng(seed=seed) + vec = rng.standard_normal(DIM).astype(np.float32) + vec /= np.linalg.norm(vec) + return vec + + def embed_batch(self, texts: list[str]) -> list[np.ndarray]: + return [self.embed_single(t) for t in texts] + + def embed(self, texts: list[str]) -> list[np.ndarray]: + """Called by SearchPipeline as self._embedder.embed([query])[0].""" + return self.embed_batch(texts) + + +class MockReranker(BaseReranker): + """Returns score based on simple keyword overlap.""" + + def score_pairs(self, query: str, documents: list[str]) -> list[float]: + query_words = set(query.lower().split()) + scores = [] + for doc in documents: + doc_words = set(doc.lower().split()) + overlap = len(query_words & doc_words) + scores.append(float(overlap) / max(len(query_words), 1)) + return scores + + +@pytest.fixture +def config(): + return Config.small() # hnsw_ef=50, hnsw_M=16, binary_top_k=50, ann_top_k=20, rerank_top_k=10 + + +@pytest.fixture +def search_pipeline(tmp_path, config): + """Build a full SearchPipeline with 20 test docs indexed.""" + embedder = MockEmbedder() + binary_store = BinaryStore(tmp_path / "binary", dim=DIM, config=config) + ann_index = ANNIndex(tmp_path / "ann.hnsw", dim=DIM, config=config) + fts = FTSEngine(tmp_path / "fts.db") + reranker = MockReranker() + + # Index all test docs + ids = np.array([d[0] for d in TEST_DOCS], dtype=np.int64) + vectors = np.array([embedder.embed_single(d[2]) for d in TEST_DOCS], dtype=np.float32) + + binary_store.add(ids, vectors) + ann_index.add(ids, vectors) + fts.add_documents(TEST_DOCS) + + return SearchPipeline( + embedder=embedder, + binary_store=binary_store, + ann_index=ann_index, + reranker=reranker, + fts=fts, + config=config, + ) diff --git a/codex-lens-v2/tests/integration/test_search_pipeline.py b/codex-lens-v2/tests/integration/test_search_pipeline.py new file mode 100644 index 00000000..6f59a612 --- /dev/null +++ b/codex-lens-v2/tests/integration/test_search_pipeline.py @@ -0,0 +1,44 @@ +"""Integration tests for SearchPipeline using real components and mock embedder/reranker.""" +from __future__ import annotations + + +def test_vector_search_returns_results(search_pipeline): + results = search_pipeline.search("authentication middleware") + assert len(results) > 0 + assert all(isinstance(r.score, float) for r in results) + + +def test_exact_keyword_search(search_pipeline): + results = search_pipeline.search("authenticate") + assert len(results) > 0 + result_ids = {r.id for r in results} + # Doc 0 and 10 both contain "authenticate" + assert result_ids & {0, 10}, f"Expected doc 0 or 10 in results, got {result_ids}" + + +def test_pipeline_top_k_limit(search_pipeline): + results = search_pipeline.search("user", top_k=5) + assert len(results) <= 5 + + +def test_search_result_fields_populated(search_pipeline): + results = search_pipeline.search("password") + assert len(results) > 0 + for r in results: + assert r.id >= 0 + assert r.score >= 0 + assert isinstance(r.path, str) + + +def test_empty_query_handled(search_pipeline): + results = search_pipeline.search("") + assert isinstance(results, list) # no exception + + +def test_different_queries_give_different_results(search_pipeline): + r1 = search_pipeline.search("authenticate user") + r2 = search_pipeline.search("cache redis") + # Results should differ (different top IDs or scores), unless both are empty + ids1 = [r.id for r in r1] + ids2 = [r.id for r in r2] + assert ids1 != ids2 or len(r1) == 0 diff --git a/codex-lens-v2/tests/unit/__init__.py b/codex-lens-v2/tests/unit/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/codex-lens-v2/tests/unit/test_config.py b/codex-lens-v2/tests/unit/test_config.py new file mode 100644 index 00000000..1c037975 --- /dev/null +++ b/codex-lens-v2/tests/unit/test_config.py @@ -0,0 +1,31 @@ +from codexlens.config import Config + + +def test_config_instantiates_no_args(): + cfg = Config() + assert cfg is not None + + +def test_defaults_hnsw_ef(): + cfg = Config.defaults() + assert cfg.hnsw_ef == 150 + + +def test_defaults_hnsw_M(): + cfg = Config.defaults() + assert cfg.hnsw_M == 32 + + +def test_small_hnsw_ef(): + cfg = Config.small() + assert cfg.hnsw_ef == 50 + + +def test_custom_instantiation(): + cfg = Config(hnsw_ef=100) + assert cfg.hnsw_ef == 100 + + +def test_fusion_weights_keys(): + cfg = Config() + assert set(cfg.fusion_weights.keys()) == {"exact", "fuzzy", "vector", "graph"} diff --git a/codex-lens-v2/tests/unit/test_core.py b/codex-lens-v2/tests/unit/test_core.py new file mode 100644 index 00000000..7cf2a513 --- /dev/null +++ b/codex-lens-v2/tests/unit/test_core.py @@ -0,0 +1,136 @@ +"""Unit tests for BinaryStore and ANNIndex (no fastembed required).""" +from __future__ import annotations + +import concurrent.futures +import tempfile +from pathlib import Path + +import numpy as np +import pytest + +from codexlens.config import Config +from codexlens.core import ANNIndex, BinaryStore + + +DIM = 32 +RNG = np.random.default_rng(42) + + +def make_vectors(n: int, dim: int = DIM) -> np.ndarray: + return RNG.standard_normal((n, dim)).astype(np.float32) + + +def make_ids(n: int, start: int = 0) -> np.ndarray: + return np.arange(start, start + n, dtype=np.int64) + + +# --------------------------------------------------------------------------- +# BinaryStore tests +# --------------------------------------------------------------------------- + + +class TestBinaryStore: + def test_binary_store_add_and_search(self, tmp_path: Path) -> None: + cfg = Config.small() + store = BinaryStore(tmp_path, DIM, cfg) + vecs = make_vectors(10) + ids = make_ids(10) + store.add(ids, vecs) + + assert len(store) == 10 + + top_k = 5 + ret_ids, ret_dists = store.coarse_search(vecs[0], top_k=top_k) + assert ret_ids.shape == (top_k,) + assert ret_dists.shape == (top_k,) + # distances are non-negative integers + assert (ret_dists >= 0).all() + + def test_binary_hamming_correctness(self, tmp_path: Path) -> None: + cfg = Config.small() + store = BinaryStore(tmp_path, DIM, cfg) + vecs = make_vectors(20) + ids = make_ids(20) + store.add(ids, vecs) + + # Query with the exact stored vector; it must be the top-1 result + query = vecs[7] + ret_ids, ret_dists = store.coarse_search(query, top_k=1) + assert ret_ids[0] == 7 + assert ret_dists[0] == 0 # Hamming distance to itself is 0 + + def test_binary_store_persist(self, tmp_path: Path) -> None: + cfg = Config.small() + store = BinaryStore(tmp_path, DIM, cfg) + vecs = make_vectors(15) + ids = make_ids(15) + store.add(ids, vecs) + store.save() + + # Load into a fresh instance + store2 = BinaryStore(tmp_path, DIM, cfg) + assert len(store2) == 15 + + query = vecs[3] + ret_ids, ret_dists = store2.coarse_search(query, top_k=1) + assert ret_ids[0] == 3 + assert ret_dists[0] == 0 + + +# --------------------------------------------------------------------------- +# ANNIndex tests +# --------------------------------------------------------------------------- + + +class TestANNIndex: + def test_ann_index_add_and_search(self, tmp_path: Path) -> None: + cfg = Config.small() + idx = ANNIndex(tmp_path, DIM, cfg) + vecs = make_vectors(50) + ids = make_ids(50) + idx.add(ids, vecs) + + assert len(idx) == 50 + + ret_ids, ret_dists = idx.fine_search(vecs[0], top_k=5) + assert len(ret_ids) == 5 + assert len(ret_dists) == 5 + + def test_ann_index_thread_safety(self, tmp_path: Path) -> None: + cfg = Config.small() + idx = ANNIndex(tmp_path, DIM, cfg) + vecs = make_vectors(50) + ids = make_ids(50) + idx.add(ids, vecs) + + query = vecs[0] + errors: list[Exception] = [] + + def search() -> None: + try: + idx.fine_search(query, top_k=3) + except Exception as exc: + errors.append(exc) + + with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool: + futures = [pool.submit(search) for _ in range(5)] + concurrent.futures.wait(futures) + + assert errors == [], f"Thread safety errors: {errors}" + + def test_ann_index_save_load(self, tmp_path: Path) -> None: + cfg = Config.small() + idx = ANNIndex(tmp_path, DIM, cfg) + vecs = make_vectors(30) + ids = make_ids(30) + idx.add(ids, vecs) + idx.save() + + # Load into a fresh instance + idx2 = ANNIndex(tmp_path, DIM, cfg) + idx2.load() + assert len(idx2) == 30 + + ret_ids, ret_dists = idx2.fine_search(vecs[10], top_k=1) + assert len(ret_ids) == 1 + assert ret_ids[0] == 10 diff --git a/codex-lens-v2/tests/unit/test_embed.py b/codex-lens-v2/tests/unit/test_embed.py new file mode 100644 index 00000000..03e28585 --- /dev/null +++ b/codex-lens-v2/tests/unit/test_embed.py @@ -0,0 +1,80 @@ +from __future__ import annotations + +import sys +import types +import unittest +from unittest.mock import MagicMock, patch + +import numpy as np + + +def _make_fastembed_mock(): + """Build a minimal fastembed stub so imports succeed without the real package.""" + fastembed_mod = types.ModuleType("fastembed") + fastembed_mod.TextEmbedding = MagicMock() + sys.modules.setdefault("fastembed", fastembed_mod) + return fastembed_mod + + +_make_fastembed_mock() + +from codexlens.config import Config # noqa: E402 +from codexlens.embed.base import BaseEmbedder # noqa: E402 +from codexlens.embed.local import EMBED_PROFILES, FastEmbedEmbedder # noqa: E402 + + +class TestEmbedSingle(unittest.TestCase): + def test_embed_single_returns_float32_ndarray(self): + config = Config() + embedder = FastEmbedEmbedder(config) + + mock_model = MagicMock() + mock_model.embed.return_value = iter([np.ones(384, dtype=np.float64)]) + + # Inject mock model directly to bypass lazy load (no real fastembed needed) + embedder._model = mock_model + result = embedder.embed_single("hello world") + + self.assertIsInstance(result, np.ndarray) + self.assertEqual(result.dtype, np.float32) + self.assertEqual(result.shape, (384,)) + + +class TestEmbedBatch(unittest.TestCase): + def test_embed_batch_returns_list(self): + config = Config() + embedder = FastEmbedEmbedder(config) + + vecs = [np.ones(384, dtype=np.float64) * i for i in range(3)] + mock_model = MagicMock() + mock_model.embed.return_value = iter(vecs) + + embedder._model = mock_model + result = embedder.embed_batch(["a", "b", "c"]) + + self.assertIsInstance(result, list) + self.assertEqual(len(result), 3) + for arr in result: + self.assertIsInstance(arr, np.ndarray) + self.assertEqual(arr.dtype, np.float32) + + +class TestEmbedProfiles(unittest.TestCase): + def test_embed_profiles_all_have_valid_keys(self): + expected_keys = {"small", "base", "large", "code"} + self.assertEqual(set(EMBED_PROFILES.keys()), expected_keys) + + def test_embed_profiles_model_ids_non_empty(self): + for key, model_id in EMBED_PROFILES.items(): + self.assertIsInstance(model_id, str, msg=f"{key} model id should be str") + self.assertTrue(len(model_id) > 0, msg=f"{key} model id should be non-empty") + + +class TestBaseEmbedderAbstract(unittest.TestCase): + def test_base_embedder_is_abstract(self): + with self.assertRaises(TypeError): + BaseEmbedder() # type: ignore[abstract] + + +if __name__ == "__main__": + unittest.main() diff --git a/codex-lens-v2/tests/unit/test_rerank.py b/codex-lens-v2/tests/unit/test_rerank.py new file mode 100644 index 00000000..36a3f4eb --- /dev/null +++ b/codex-lens-v2/tests/unit/test_rerank.py @@ -0,0 +1,179 @@ +from __future__ import annotations + +import types +from unittest.mock import MagicMock, patch + +import pytest + +from codexlens.config import Config +from codexlens.rerank.base import BaseReranker +from codexlens.rerank.local import FastEmbedReranker +from codexlens.rerank.api import APIReranker + + +# --------------------------------------------------------------------------- +# BaseReranker +# --------------------------------------------------------------------------- + +def test_base_reranker_is_abstract(): + with pytest.raises(TypeError): + BaseReranker() # type: ignore[abstract] + + +# --------------------------------------------------------------------------- +# FastEmbedReranker +# --------------------------------------------------------------------------- + +def _make_rerank_result(index: int, score: float) -> object: + obj = types.SimpleNamespace(index=index, score=score) + return obj + + +def test_local_reranker_score_pairs_length(): + config = Config() + reranker = FastEmbedReranker(config) + + mock_results = [ + _make_rerank_result(0, 0.9), + _make_rerank_result(1, 0.5), + _make_rerank_result(2, 0.1), + ] + + mock_model = MagicMock() + mock_model.rerank.return_value = iter(mock_results) + reranker._model = mock_model + + docs = ["doc0", "doc1", "doc2"] + scores = reranker.score_pairs("query", docs) + + assert len(scores) == 3 + + +def test_local_reranker_preserves_order(): + config = Config() + reranker = FastEmbedReranker(config) + + # rerank returns results in reverse order (index 2, 1, 0) + mock_results = [ + _make_rerank_result(2, 0.1), + _make_rerank_result(1, 0.5), + _make_rerank_result(0, 0.9), + ] + + mock_model = MagicMock() + mock_model.rerank.return_value = iter(mock_results) + reranker._model = mock_model + + docs = ["doc0", "doc1", "doc2"] + scores = reranker.score_pairs("query", docs) + + assert scores[0] == pytest.approx(0.9) + assert scores[1] == pytest.approx(0.5) + assert scores[2] == pytest.approx(0.1) + + +# --------------------------------------------------------------------------- +# APIReranker +# --------------------------------------------------------------------------- + +def _make_config(max_tokens_per_batch: int = 512) -> Config: + return Config( + reranker_api_url="https://api.example.com", + reranker_api_key="test-key", + reranker_api_model="test-model", + reranker_api_max_tokens_per_batch=max_tokens_per_batch, + ) + + +def test_api_reranker_batch_splitting(): + config = _make_config(max_tokens_per_batch=512) + + with patch("httpx.Client"): + reranker = APIReranker(config) + + # 10 docs, each ~200 tokens (800 chars) + docs = ["x" * 800] * 10 + batches = reranker._split_batches(docs, max_tokens=512) + + # Each doc is 200 tokens; batches should have at most 2 docs (200+200=400 <= 512, 400+200=600 > 512) + assert len(batches) > 1 + for batch in batches: + total = sum(len(text) // 4 for _, text in batch) + assert total <= 512 or len(batch) == 1 + + +def test_api_reranker_retry_on_429(): + config = _make_config() + + mock_429 = MagicMock() + mock_429.status_code = 429 + + mock_200 = MagicMock() + mock_200.status_code = 200 + mock_200.json.return_value = { + "results": [ + {"index": 0, "relevance_score": 0.8}, + {"index": 1, "relevance_score": 0.3}, + ] + } + mock_200.raise_for_status = MagicMock() + + with patch("httpx.Client") as mock_client_cls: + mock_client = MagicMock() + mock_client_cls.return_value = mock_client + mock_client.post.side_effect = [mock_429, mock_429, mock_200] + + reranker = APIReranker(config) + + with patch("time.sleep"): + result = reranker._call_api_with_retry( + "query", + [(0, "doc0"), (1, "doc1")], + max_retries=3, + ) + + assert mock_client.post.call_count == 3 + assert 0 in result + assert 1 in result + + +def test_api_reranker_merge_batches(): + config = _make_config(max_tokens_per_batch=100) + + # 4 docs of 25 tokens each (100 chars); each batch holds at most 4 docs + # Use smaller docs to force 2 batches: 2 docs per batch (50 tokens each = 200 chars) + docs = ["x" * 200] * 4 # 50 tokens each; 50+50=100 <= 100, 100+50=150 > 100 -> 2 per batch + + batch0_response = MagicMock() + batch0_response.status_code = 200 + batch0_response.json.return_value = { + "results": [ + {"index": 0, "relevance_score": 0.9}, + {"index": 1, "relevance_score": 0.8}, + ] + } + batch0_response.raise_for_status = MagicMock() + + batch1_response = MagicMock() + batch1_response.status_code = 200 + batch1_response.json.return_value = { + "results": [ + {"index": 0, "relevance_score": 0.7}, + {"index": 1, "relevance_score": 0.6}, + ] + } + batch1_response.raise_for_status = MagicMock() + + with patch("httpx.Client") as mock_client_cls: + mock_client = MagicMock() + mock_client_cls.return_value = mock_client + mock_client.post.side_effect = [batch0_response, batch1_response] + + reranker = APIReranker(config) + + with patch("time.sleep"): + scores = reranker.score_pairs("query", docs) + + assert len(scores) == 4 + # All original indices should have scores + assert all(s > 0 for s in scores) diff --git a/codex-lens-v2/tests/unit/test_search.py b/codex-lens-v2/tests/unit/test_search.py new file mode 100644 index 00000000..1e1093d7 --- /dev/null +++ b/codex-lens-v2/tests/unit/test_search.py @@ -0,0 +1,156 @@ +"""Unit tests for search layer: FTSEngine, fusion, and SearchPipeline.""" +from __future__ import annotations + +from unittest.mock import MagicMock + +import pytest + +from codexlens.search.fts import FTSEngine +from codexlens.search.fusion import ( + DEFAULT_WEIGHTS, + QueryIntent, + detect_query_intent, + get_adaptive_weights, + reciprocal_rank_fusion, +) +from codexlens.search.pipeline import SearchPipeline, SearchResult +from codexlens.config import Config + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def make_fts(docs: list[tuple[int, str, str]] | None = None) -> FTSEngine: + """Create an in-memory FTSEngine and optionally add documents.""" + engine = FTSEngine(":memory:") + if docs: + engine.add_documents(docs) + return engine + + +# --------------------------------------------------------------------------- +# FTSEngine tests +# --------------------------------------------------------------------------- + +def test_fts_add_and_exact_search(): + docs = [ + (1, "a.py", "def authenticate user password login"), + (2, "b.py", "connect to database with credentials"), + (3, "c.py", "render template html response"), + ] + engine = make_fts(docs) + results = engine.exact_search("authenticate", top_k=10) + ids = [r[0] for r in results] + assert 1 in ids, "doc 1 should match 'authenticate'" + assert 2 not in ids or results[0][0] == 1 # doc 1 must rank higher + + +def test_fts_fuzzy_search_prefix(): + docs = [ + (10, "auth.py", "authentication token refresh"), + (11, "db.py", "database connection pool"), + (12, "ui.py", "render button click handler"), + ] + engine = make_fts(docs) + # Prefix 'auth' should match 'authentication' in doc 10 + results = engine.fuzzy_search("auth", top_k=10) + ids = [r[0] for r in results] + assert 10 in ids, "prefix 'auth' should match doc 10 with 'authentication'" + + +# --------------------------------------------------------------------------- +# RRF fusion tests +# --------------------------------------------------------------------------- + +def test_rrf_fusion_ordering(): + """When two sources agree on top-1, it should rank first in fused result.""" + source_a = [(1, 0.9), (2, 0.5), (3, 0.2)] + source_b = [(1, 0.8), (3, 0.6), (2, 0.1)] + fused = reciprocal_rank_fusion({"a": source_a, "b": source_b}) + assert fused[0][0] == 1, "doc 1 agreed top by both sources must rank first" + + +def test_rrf_equal_weight_default(): + """Calling with None weights should use DEFAULT_WEIGHTS shape (not crash).""" + source_exact = [(5, 1.0), (6, 0.8)] + source_vector = [(6, 0.9), (5, 0.7)] + # Should not raise and should return results + fused = reciprocal_rank_fusion( + {"exact": source_exact, "vector": source_vector}, + weights=None, + ) + assert len(fused) == 2 + ids = [r[0] for r in fused] + assert 5 in ids and 6 in ids + + +# --------------------------------------------------------------------------- +# detect_query_intent tests +# --------------------------------------------------------------------------- + +def test_detect_intent_code_symbol(): + assert detect_query_intent("def authenticate()") == QueryIntent.CODE_SYMBOL + + +def test_detect_intent_natural(): + assert detect_query_intent("how do I authenticate users") == QueryIntent.NATURAL_LANGUAGE + + +# --------------------------------------------------------------------------- +# SearchPipeline tests +# --------------------------------------------------------------------------- + +def _make_pipeline(fts: FTSEngine, top_k: int = 5) -> SearchPipeline: + """Build a SearchPipeline with mocked heavy components.""" + cfg = Config.small() + cfg.reranker_top_k = top_k + + embedder = MagicMock() + embedder.embed.return_value = [[0.1] * cfg.embed_dim] + + binary_store = MagicMock() + binary_store.coarse_search.return_value = ([1, 2, 3], None) + + ann_index = MagicMock() + ann_index.fine_search.return_value = ([1, 2, 3], [0.9, 0.8, 0.7]) + + reranker = MagicMock() + # Return a score for each content string passed + reranker.score_pairs.side_effect = lambda q, contents: [0.9 - i * 0.1 for i in range(len(contents))] + + return SearchPipeline( + embedder=embedder, + binary_store=binary_store, + ann_index=ann_index, + reranker=reranker, + fts=fts, + config=cfg, + ) + + +def test_pipeline_search_returns_results(): + docs = [ + (1, "a.py", "test content alpha"), + (2, "b.py", "test content beta"), + (3, "c.py", "test content gamma"), + ] + fts = make_fts(docs) + pipeline = _make_pipeline(fts) + results = pipeline.search("test") + assert len(results) > 0 + assert all(isinstance(r, SearchResult) for r in results) + + +def test_pipeline_top_k_limit(): + docs = [ + (1, "a.py", "hello world one"), + (2, "b.py", "hello world two"), + (3, "c.py", "hello world three"), + (4, "d.py", "hello world four"), + (5, "e.py", "hello world five"), + ] + fts = make_fts(docs) + pipeline = _make_pipeline(fts, top_k=2) + results = pipeline.search("hello", top_k=2) + assert len(results) <= 2, "pipeline must respect top_k limit"