Implement search and reranking functionality with FTS and embedding support

- Added BaseReranker abstract class for defining reranking interfaces.
- Implemented FastEmbedReranker using fastembed's TextCrossEncoder for scoring document-query pairs.
- Introduced FTSEngine for full-text search capabilities using SQLite FTS5.
- Developed SearchPipeline to integrate embedding, binary search, ANN indexing, FTS, and reranking.
- Added fusion methods for combining results from different search strategies using Reciprocal Rank Fusion.
- Created unit and integration tests for the new search and reranking components.
- Established configuration management for search parameters and models.
This commit is contained in:
catlog22
2026-03-16 23:03:17 +08:00
parent 5a4b18d9b1
commit de4158597b
41 changed files with 2655 additions and 1848 deletions

View File

@@ -1,76 +0,0 @@
## Context Acquisition (MCP Tools Priority)
**For task context gathering and analysis, ALWAYS prefer MCP tools**:
1. **mcp__ace-tool__search_context** - HIGHEST PRIORITY for code discovery
- Semantic search with real-time codebase index
- Use for: finding implementations, understanding architecture, locating patterns
- Example: `mcp__ace-tool__search_context(project_root_path="/path", query="authentication logic")`
2. **smart_search** - Fallback for structured search
- Use `smart_search(query="...")` for keyword/regex search
- Use `smart_search(action="find_files", pattern="*.ts")` for file discovery
- Supports modes: `auto`, `hybrid`, `exact`, `ripgrep`
3. **read_file** - Batch file reading
- Read multiple files in parallel: `read_file(path="file1.ts")`, `read_file(path="file2.ts")`
- Supports glob patterns: `read_file(path="src/**/*.config.ts")`
**Priority Order**:
```
ACE search_context (semantic) → smart_search (structured) → read_file (batch read) → shell commands (fallback)
```
**NEVER** use shell commands (`cat`, `find`, `grep`) when MCP tools are available.
### read_file - Read File Contents
**When**: Read files found by smart_search
**How**:
```javascript
read_file(path="/path/to/file.ts") // Single file
read_file(path="/src/**/*.config.ts") // Pattern matching
```
---
### edit_file - Modify Files
**When**: Built-in Edit tool fails or need advanced features
**How**:
```javascript
edit_file(path="/file.ts", old_string="...", new_string="...", mode="update")
edit_file(path="/file.ts", line=10, content="...", mode="insert_after")
```
**Modes**: `update` (replace text), `insert_after`, `insert_before`, `delete_line`
---
### write_file - Create/Overwrite Files
**When**: Create new files or completely replace content
**How**:
```javascript
write_file(path="/new-file.ts", content="...")
```
---
### Exa - External Search
**When**: Find documentation/examples outside codebase
**How**:
```javascript
mcp__exa__search(query="React hooks 2025 documentation")
mcp__exa__search(query="FastAPI auth example", numResults=10)
mcp__exa__search(query="latest API docs", livecrawl="always")
```
**Parameters**:
- `query` (required): Search query string
- `numResults` (optional): Number of results to return (default: 5)
- `livecrawl` (optional): `"always"` or `"fallback"` for live crawling

View File

@@ -1,64 +0,0 @@
# File Modification
Before modifying files, always:
- Try built-in Edit tool first
- Escalate to MCP tools when built-ins fail
- Use write_file only as last resort
## MCP Tools Usage
### edit_file - Modify Files
**When**: Built-in Edit fails, need dry-run preview, or need line-based operations
**How**:
```javascript
edit_file(path="/file.ts", oldText="old", newText="new") // Replace text
edit_file(path="/file.ts", oldText="old", newText="new", dryRun=true) // Preview diff
edit_file(path="/file.ts", oldText="old", newText="new", replaceAll=true) // Replace all
edit_file(path="/file.ts", mode="line", operation="insert_after", line=10, text="new line")
edit_file(path="/file.ts", mode="line", operation="delete", line=5, end_line=8)
```
**Modes**: `update` (replace text, default), `line` (line-based operations)
**Operations** (line mode): `insert_before`, `insert_after`, `replace`, `delete`
---
### write_file - Create/Overwrite Files
**When**: Create new files, completely replace content, or edit_file still fails
**How**:
```javascript
write_file(path="/new-file.ts", content="file content here")
write_file(path="/existing.ts", content="...", backup=true) // Create backup first
```
---
## Priority Logic
> **Note**: Search priority is defined in `context-tools.md` - smart_search has HIGHEST PRIORITY for all discovery tasks.
**Search & Discovery** (defer to context-tools.md):
1. **smart_search FIRST** for any code/file discovery
2. Built-in Grep only for single-file exact line search (location already confirmed)
3. Exa for external/public knowledge
**File Reading**:
1. Unknown location → **smart_search first**, then Read
2. Known confirmed file → Built-in Read directly
3. Pattern matching → smart_search (action="find_files")
**File Editing**:
1. Always try built-in Edit first
2. Fails 1+ times → edit_file (MCP)
3. Still fails → write_file (MCP)
## Decision Triggers
**Search tasks** → Always start with smart_search (per context-tools.md)
**Known file edits** → Start with built-in Edit, escalate to MCP if fails
**External knowledge** → Use Exa

View File

@@ -1,336 +0,0 @@
# Review Directory Specification
## Overview
Unified directory structure for all review commands (session-based and module-based) within workflow sessions.
## Core Principles
1. **Session-Based**: All reviews run within a workflow session context
2. **Unified Structure**: Same directory layout for all review types
3. **Type Differentiation**: Review type indicated by metadata, not directory structure
4. **Progressive Creation**: Directories created on-demand during review execution
5. **Archive Support**: Reviews archived with their parent session
## Directory Structure
### Base Location
```
.workflow/active/WFS-{session-id}/.review/
```
### Complete Structure
```
.workflow/active/WFS-{session-id}/.review/
├── review-state.json # Review orchestrator state machine
├── review-progress.json # Real-time progress for dashboard polling
├── review-metadata.json # Review configuration and scope
├── dimensions/ # Per-dimension analysis results
│ ├── security.json
│ ├── architecture.json
│ ├── quality.json
│ ├── action-items.json
│ ├── performance.json
│ ├── maintainability.json
│ └── best-practices.json
├── iterations/ # Deep-dive iteration results
│ ├── iteration-1-finding-{uuid}.json
│ ├── iteration-2-finding-{uuid}.json
│ └── ...
├── reports/ # Human-readable reports
│ ├── security-analysis.md
│ ├── security-cli-output.txt
│ ├── architecture-analysis.md
│ ├── architecture-cli-output.txt
│ ├── ...
│ ├── deep-dive-1-{uuid}.md
│ └── deep-dive-2-{uuid}.md
├── REVIEW-SUMMARY.md # Final consolidated summary
└── dashboard.html # Interactive review dashboard
```
## Review Metadata Schema
**File**: `review-metadata.json`
```json
{
"review_id": "review-20250125-143022",
"review_type": "module|session",
"session_id": "WFS-auth-system",
"created_at": "2025-01-25T14:30:22Z",
"scope": {
"type": "module|session",
"module_scope": {
"target_pattern": "src/auth/**",
"resolved_files": [
"src/auth/service.ts",
"src/auth/validator.ts"
],
"file_count": 2
},
"session_scope": {
"commit_range": "abc123..def456",
"changed_files": [
"src/auth/service.ts",
"src/payment/processor.ts"
],
"file_count": 2
}
},
"dimensions": ["security", "architecture", "quality", "action-items", "performance", "maintainability", "best-practices"],
"max_iterations": 3,
"cli_tools": {
"primary": "gemini",
"fallback": ["qwen", "codex"]
}
}
```
## Review State Schema
**File**: `review-state.json`
```json
{
"review_id": "review-20250125-143022",
"phase": "init|parallel|aggregate|iterate|complete",
"current_iteration": 1,
"dimensions_status": {
"security": "pending|in_progress|completed|failed",
"architecture": "completed",
"quality": "in_progress",
"action-items": "pending",
"performance": "pending",
"maintainability": "pending",
"best-practices": "pending"
},
"severity_distribution": {
"critical": 2,
"high": 5,
"medium": 12,
"low": 8
},
"critical_files": [
"src/auth/service.ts",
"src/payment/processor.ts"
],
"iterations": [
{
"iteration": 1,
"findings_selected": ["uuid-1", "uuid-2", "uuid-3"],
"completed_at": "2025-01-25T15:30:00Z"
}
],
"completion_criteria": {
"critical_count": 0,
"high_count_threshold": 5,
"max_iterations": 3
},
"next_action": "execute_parallel_reviews|aggregate_findings|execute_deep_dive|generate_final_report|complete"
}
```
## Session Integration
### Session Discovery
**review-session-cycle** (auto-discover):
```bash
# Auto-detect active session
/workflow:review-session-cycle
# Or specify session explicitly
/workflow:review-session-cycle WFS-auth-system
```
**review-module-cycle** (require session):
```bash
# Must have active session or specify one
/workflow:review-module-cycle src/auth/** --session WFS-auth-system
# Or use active session
/workflow:review-module-cycle src/auth/**
```
### Session Creation Logic
**For review-module-cycle**:
1. **Check Active Session**: Search `.workflow/active/WFS-*`
2. **If Found**: Use active session's `.review/` directory
3. **If Not Found**:
- **Option A** (Recommended): Prompt user to create session first
- **Option B**: Auto-create review-only session: `WFS-review-{pattern-hash}`
**Recommended Flow**:
```bash
# Step 1: Start session
/workflow:session:start --new "Review auth module"
# Creates: .workflow/active/WFS-review-auth-module/
# Step 2: Run review
/workflow:review-module-cycle src/auth/**
# Creates: .workflow/active/WFS-review-auth-module/.review/
```
## Command Phase 1 Requirements
### Both Commands Must:
1. **Session Discovery**:
```javascript
// Check for active session
const sessions = Glob('.workflow/active/WFS-*');
if (sessions.length === 0) {
// Prompt user to create session first
error("No active session found. Please run /workflow:session:start first");
}
const sessionId = sessions[0].match(/WFS-[^/]+/)[0];
```
2. **Create .review/ Structure**:
```javascript
const reviewDir = `.workflow/active/${sessionId}/.review/`;
// Create directory structure
Bash(`mkdir -p ${reviewDir}/dimensions`);
Bash(`mkdir -p ${reviewDir}/iterations`);
Bash(`mkdir -p ${reviewDir}/reports`);
```
3. **Initialize Metadata**:
```javascript
// Write review-metadata.json
Write(`${reviewDir}/review-metadata.json`, JSON.stringify({
review_id: `review-${timestamp}`,
review_type: "module|session",
session_id: sessionId,
created_at: new Date().toISOString(),
scope: {...},
dimensions: [...],
max_iterations: 3,
cli_tools: {...}
}));
// Write review-state.json
Write(`${reviewDir}/review-state.json`, JSON.stringify({
review_id: `review-${timestamp}`,
phase: "init",
current_iteration: 0,
dimensions_status: {},
severity_distribution: {},
critical_files: [],
iterations: [],
completion_criteria: {},
next_action: "execute_parallel_reviews"
}));
```
4. **Generate Dashboard**:
```javascript
const template = Read('~/.claude/templates/review-cycle-dashboard.html');
const dashboard = template
.replace('{{SESSION_ID}}', sessionId)
.replace('{{REVIEW_TYPE}}', reviewType)
.replace('{{REVIEW_DIR}}', reviewDir);
Write(`${reviewDir}/dashboard.html`, dashboard);
// Output to user
console.log(`📊 Review Dashboard: file://${absolutePath(reviewDir)}/dashboard.html`);
console.log(`📂 Review Output: ${reviewDir}`);
```
## Archive Strategy
### On Session Completion
When `/workflow:session:complete` is called:
1. **Preserve Review Directory**:
```javascript
// Move entire session including .review/
Bash(`mv .workflow/active/${sessionId} .workflow/archives/${sessionId}`);
```
2. **Review Archive Structure**:
```
.workflow/archives/WFS-auth-system/
├── workflow-session.json
├── IMPL_PLAN.md
├── TODO_LIST.md
├── .task/
├── .summaries/
└── .review/ # Review results preserved
├── review-metadata.json
├── REVIEW-SUMMARY.md
└── dashboard.html
```
3. **Access Archived Reviews**:
```bash
# Open archived dashboard
start .workflow/archives/WFS-auth-system/.review/dashboard.html
```
## Benefits
### 1. Unified Structure
- Same directory layout for all review types
- Consistent file naming and schemas
- Easier maintenance and tooling
### 2. Session Integration
- Review history tracked with implementation
- Easy correlation between code changes and reviews
- Simplified archiving and retrieval
### 3. Progressive Creation
- Directories created only when needed
- No upfront overhead
- Clean session initialization
### 4. Type Flexibility
- Module-based and session-based reviews in same structure
- Type indicated by metadata, not directory layout
- Easy to add new review types
### 5. Dashboard Consistency
- Same dashboard template for both types
- Unified progress tracking
- Consistent user experience
## Migration Path
### For Existing Commands
**review-session-cycle**:
1. Change output from `.workflow/.reviews/session-{id}/` to `.workflow/active/{session-id}/.review/`
2. Update Phase 1 to use session discovery
3. Add review-metadata.json creation
**review-module-cycle**:
1. Add session requirement (or auto-create)
2. Change output from `.workflow/.reviews/module-{hash}/` to `.workflow/active/{session-id}/.review/`
3. Update Phase 1 to use session discovery
4. Add review-metadata.json creation
### Backward Compatibility
**For existing standalone reviews** in `.workflow/.reviews/`:
- Keep for reference
- Document migration in README
- Provide migration script if needed
## Implementation Checklist
- [ ] Update workflow-architecture.md with .review/ structure
- [ ] Update review-session-cycle.md command specification
- [ ] Update review-module-cycle.md command specification
- [ ] Update review-cycle-dashboard.html template
- [ ] Create review-metadata.json schema validation
- [ ] Update /workflow:session:complete to preserve .review/
- [ ] Update documentation examples
- [ ] Test both review types with new structure
- [ ] Validate dashboard compatibility
- [ ] Document migration path for existing reviews

View File

@@ -1,214 +0,0 @@
# Task System Core Reference
## Overview
Task commands provide single-execution workflow capabilities with full context awareness, hierarchical organization, and agent orchestration.
## Task JSON Schema
All task files use this simplified 5-field schema:
```json
{
"id": "IMPL-1.2",
"title": "Implement JWT authentication",
"status": "pending|active|completed|blocked|container",
"meta": {
"type": "feature|bugfix|refactor|test-gen|test-fix|docs",
"agent": "@code-developer|@action-planning-agent|@test-fix-agent|@universal-executor"
},
"context": {
"requirements": ["JWT authentication", "OAuth2 support"],
"focus_paths": ["src/auth", "tests/auth", "config/auth.json"],
"acceptance": ["JWT validation works", "OAuth flow complete"],
"parent": "IMPL-1",
"depends_on": ["IMPL-1.1"],
"inherited": {
"from": "IMPL-1",
"context": ["Authentication system design completed"]
},
"shared_context": {
"auth_strategy": "JWT with refresh tokens"
}
},
"flow_control": {
"pre_analysis": [
{
"step": "gather_context",
"action": "Read dependency summaries",
"command": "bash(cat .workflow/*/summaries/IMPL-1.1-summary.md)",
"output_to": "auth_design_context",
"on_error": "skip_optional"
}
],
"implementation_approach": [
{
"step": 1,
"title": "Implement JWT authentication system",
"description": "Implement comprehensive JWT authentication system with token generation, validation, and refresh logic",
"modification_points": ["Add JWT token generation", "Implement token validation middleware", "Create refresh token logic"],
"logic_flow": ["User login request → validate credentials", "Generate JWT access and refresh tokens", "Store refresh token securely", "Return tokens to client"],
"depends_on": [],
"output": "jwt_implementation"
}
],
"target_files": [
"src/auth/login.ts:handleLogin:75-120",
"src/middleware/auth.ts:validateToken",
"src/auth/PasswordReset.ts"
]
}
}
```
## Field Structure Details
### focus_paths Field (within context)
**Purpose**: Specifies concrete project paths relevant to task implementation
**Format**:
- **Array of strings**: `["folder1", "folder2", "specific_file.ts"]`
- **Concrete paths**: Use actual directory/file names without wildcards
- **Mixed types**: Can include both directories and specific files
- **Relative paths**: From project root (e.g., `src/auth`, not `./src/auth`)
**Examples**:
```json
// Authentication system task
"focus_paths": ["src/auth", "tests/auth", "config/auth.json", "src/middleware/auth.ts"]
// UI component task
"focus_paths": ["src/components/Button", "src/styles", "tests/components"]
```
### flow_control Field Structure
**Purpose**: Universal process manager for task execution
**Components**:
- **pre_analysis**: Array of sequential process steps
- **implementation_approach**: Task execution strategy
- **target_files**: Files to modify/create - existing files in `file:function:lines` format, new files as `file` only
**Step Structure**:
```json
{
"step": "gather_context",
"action": "Human-readable description",
"command": "bash(executable command with [variables])",
"output_to": "variable_name",
"on_error": "skip_optional|fail|retry_once|manual_intervention"
}
```
## Hierarchical System
### Task Hierarchy Rules
- **Format**: IMPL-N (main), IMPL-N.M (subtasks) - uppercase required
- **Maximum Depth**: 2 levels only
- **10-Task Limit**: Hard limit enforced across all tasks
- **Container Tasks**: Parents with subtasks (not executable)
- **Leaf Tasks**: No subtasks (executable)
- **File Cohesion**: Related files must stay in same task
### Task Complexity Classifications
- **Simple**: ≤5 tasks, single-level tasks, direct execution
- **Medium**: 6-10 tasks, two-level hierarchy, context coordination
- **Over-scope**: >10 tasks requires project re-scoping into iterations
### Complexity Assessment Rules
- **Creation**: System evaluates and assigns complexity
- **10-task limit**: Hard limit enforced - exceeding requires re-scoping
- **Execution**: Can upgrade (Simple→Medium→Over-scope), triggers re-scoping
- **Override**: Users can manually specify complexity within 10-task limit
### Status Rules
- **pending**: Ready for execution
- **active**: Currently being executed
- **completed**: Successfully finished
- **blocked**: Waiting for dependencies
- **container**: Has subtasks (parent only)
## Session Integration
### Active Session Detection
```bash
# Check for active session in sessions directory
active_session=$(find .workflow/active/ -name 'WFS-*' -type d 2>/dev/null | head -1)
```
### Workflow Context Inheritance
Tasks inherit from:
1. `workflow-session.json` - Session metadata
2. Parent task context (for subtasks)
3. `IMPL_PLAN.md` - Planning document
### File Locations
- **Task JSON**: `.workflow/active/WFS-[topic]/.task/IMPL-*.json` (uppercase required)
- **Session State**: `.workflow/active/WFS-[topic]/workflow-session.json`
- **Planning Doc**: `.workflow/active/WFS-[topic]/IMPL_PLAN.md`
- **Progress**: `.workflow/active/WFS-[topic]/TODO_LIST.md`
## Agent Mapping
### Automatic Agent Selection
- **@code-developer**: Implementation tasks, coding, test writing
- **@action-planning-agent**: Design, architecture planning
- **@test-fix-agent**: Test execution, failure diagnosis, code fixing
- **@universal-executor**: Optional manual review (only when explicitly requested)
### Agent Context Filtering
Each agent receives tailored context:
- **@code-developer**: Complete implementation details, test requirements
- **@action-planning-agent**: High-level requirements, risks, architecture
- **@test-fix-agent**: Test execution, failure diagnosis, code fixing
- **@universal-executor**: Quality standards, security considerations (when requested)
## Deprecated Fields
### Legacy paths Field
**Deprecated**: The semicolon-separated `paths` field has been replaced by `context.focus_paths` array.
**Old Format** (no longer used):
```json
"paths": "src/auth;tests/auth;config/auth.json;src/middleware/auth.ts"
```
**New Format** (use this instead):
```json
"context": {
"focus_paths": ["src/auth", "tests/auth", "config/auth.json", "src/middleware/auth.ts"]
}
```
## Validation Rules
### Pre-execution Checks
1. Task exists and is valid JSON
2. Task status allows operation
3. Dependencies are met
4. Active workflow session exists
5. All 5 core fields present (id, title, status, meta, context, flow_control)
6. Total task count ≤ 10 (hard limit)
7. File cohesion maintained in focus_paths
### Hierarchy Validation
- Parent-child relationships valid
- Maximum depth not exceeded
- Container tasks have subtasks
- No circular dependencies
## Error Handling Patterns
### Common Errors
- **Task not found**: Check ID format and session
- **Invalid status**: Verify task can be operated on
- **Missing session**: Ensure active workflow exists
- **Max depth exceeded**: Restructure hierarchy
- **Missing implementation**: Complete required fields
### Recovery Strategies
- Session validation with clear guidance
- Automatic ID correction suggestions
- Implementation field completion prompts
- Hierarchy restructuring options

View File

@@ -1,216 +0,0 @@
# Tool Strategy - When to Use What
> **Focus**: Decision triggers and selection logic, NOT syntax (already registered with Claude)
## Quick Decision Tree
```
Need context?
├─ Exa available? → Use Exa (fastest, most comprehensive)
├─ Large codebase (>500 files)? → codex_lens
├─ Known files (<5)? → Read tool
└─ Unknown files? → smart_search → Read tool
Need to modify files?
├─ Built-in Edit fails? → mcp__ccw-tools__edit_file
└─ Still fails? → mcp__ccw-tools__write_file
Need to search?
├─ Semantic/concept search? → smart_search (mode=semantic)
├─ Exact pattern match? → Grep tool
└─ Multiple search modes needed? → smart_search (mode=auto)
```
---
## 1. Context Gathering Tools
### Exa (`mcp__exa__get_code_context_exa`)
**Use When**:
- ✅ Researching external APIs, libraries, frameworks
- ✅ Need recent documentation (post-cutoff knowledge)
- ✅ Looking for implementation examples in public repos
- ✅ Comparing architectural patterns across projects
**Don't Use When**:
- ❌ Searching internal codebase (use smart_search/codex_lens)
- ❌ Files already in working directory (use Read)
**Trigger Indicators**:
- User mentions specific library/framework names
- Questions about "best practices", "how does X work"
- Need to verify current API signatures
---
### read_file (`mcp__ccw-tools__read_file`)
**Use When**:
- ✅ Reading multiple related files at once (batch reading)
- ✅ Need directory traversal with pattern matching
- ✅ Searching file content with regex (`contentPattern`)
- ✅ Want to limit depth/file count for large directories
**Don't Use When**:
- ❌ Single file read → Use built-in Read tool (faster)
- ❌ Unknown file locations → Use smart_search first
- ❌ Need semantic search → Use smart_search or codex_lens
**Trigger Indicators**:
- Need to read "all TypeScript files in src/"
- Need to find "files containing TODO comments"
- Want to read "up to 20 config files"
**Advantages over Built-in Read**:
- Batch operation (multiple files in one call)
- Pattern-based filtering (glob + content regex)
- Directory traversal with depth control
---
### codex_lens (`mcp__ccw-tools__codex_lens`)
**Use When**:
- ✅ Large codebase (>500 files) requiring repeated searches
- ✅ Need semantic understanding of code relationships
- ✅ Working across multiple sessions (persistent index)
- ✅ Symbol-level navigation needed
**Don't Use When**:
- ❌ Small project (<100 files) → Use smart_search (no indexing overhead)
- ❌ One-time search → Use smart_search or Grep
- ❌ Files change frequently → Indexing overhead not worth it
**Trigger Indicators**:
- "Find all implementations of interface X"
- "What calls this function across the codebase?"
- Multi-session workflow on same codebase
**Action Selection**:
- `init`: First time in new codebase
- `search`: Find code patterns
- `search_files`: Find files by path/name pattern
- `symbol`: Get symbols in specific file
- `status`: Check if index exists/is stale
- `clean`: Remove stale index
---
### smart_search (`mcp__ccw-tools__smart_search`)
**Use When**:
- ✅ Don't know exact file locations
- ✅ Need concept/semantic search ("authentication logic")
- ✅ Medium-sized codebase (100-500 files)
- ✅ One-time or infrequent searches
**Don't Use When**:
- ❌ Known exact file path → Use Read directly
- ❌ Large codebase + repeated searches → Use codex_lens
- ❌ Exact pattern match → Use Grep (faster)
**Mode Selection**:
- `auto`: Let tool decide (default, safest)
- `exact`: Know exact pattern, need fast results
- `fuzzy`: Typo-tolerant file/symbol names
- `semantic`: Concept-based ("error handling", "data validation")
- `graph`: Dependency/relationship analysis
**Trigger Indicators**:
- "Find files related to user authentication"
- "Where is the payment processing logic?"
- "Locate database connection setup"
---
## 2. File Modification Tools
### edit_file (`mcp__ccw-tools__edit_file`)
**Use When**:
- ✅ Built-in Edit tool failed 1+ times
- ✅ Need dry-run preview before applying
- ✅ Need line-based operations (insert_after, insert_before)
- ✅ Need to replace all occurrences
**Don't Use When**:
- ❌ Built-in Edit hasn't failed yet → Try built-in first
- ❌ Need to create new file → Use write_file
**Trigger Indicators**:
- Built-in Edit returns "old_string not found"
- Built-in Edit fails due to whitespace/formatting
- Need to verify changes before applying (dryRun=true)
**Mode Selection**:
- `mode=update`: Replace text (similar to built-in Edit)
- `mode=line`: Line-based operations (insert_after, insert_before, delete)
---
### write_file (`mcp__ccw-tools__write_file`)
**Use When**:
- ✅ Creating brand new files
- ✅ MCP edit_file still fails (last resort)
- ✅ Need to completely replace file content
- ✅ Need backup before overwriting
**Don't Use When**:
- ❌ File exists + small change → Use Edit tools
- ❌ Built-in Edit hasn't been tried → Try built-in Edit first
**Trigger Indicators**:
- All Edit attempts failed
- Need to create new file with specific content
- User explicitly asks to "recreate file"
---
## 3. Decision Logic
### File Reading Priority
```
1. Known single file? → Built-in Read
2. Multiple files OR pattern matching? → mcp__ccw-tools__read_file
3. Unknown location? → smart_search, then Read
4. Large codebase + repeated access? → codex_lens
```
### File Editing Priority
```
1. Always try built-in Edit first
2. Fails 1+ times? → mcp__ccw-tools__edit_file
3. Still fails? → mcp__ccw-tools__write_file (last resort)
```
### Search Tool Priority
```
1. External knowledge? → Exa
2. Exact pattern in small codebase? → Built-in Grep
3. Semantic/unknown location? → smart_search
4. Large codebase + repeated searches? → codex_lens
```
---
## 4. Anti-Patterns
**Don't**:
- Use codex_lens for one-time searches in small projects
- Use smart_search when file path is already known
- Use write_file before trying Edit tools
- Use Exa for internal codebase searches
- Use read_file for single file when Read tool works
**Do**:
- Start with simplest tool (Read, Edit, Grep)
- Escalate to MCP tools when built-ins fail
- Use semantic search (smart_search) for exploratory tasks
- Use indexed search (codex_lens) for large, stable codebases
- Use Exa for external/public knowledge

View File

@@ -1,942 +0,0 @@
# Workflow Architecture
## Overview
This document defines the complete workflow system architecture using a **JSON-only data model**, **marker-based session management**, and **unified file structure** with dynamic task decomposition.
## Core Architecture
### JSON-Only Data Model
**JSON files (.task/IMPL-*.json) are the only authoritative source of task state. All markdown documents are read-only generated views.**
- **Task State**: Stored exclusively in JSON files
- **Documents**: Generated on-demand from JSON data
- **No Synchronization**: Eliminates bidirectional sync complexity
- **Performance**: Direct JSON access without parsing overhead
### Key Design Decisions
- **JSON files are the single source of truth** - All markdown documents are read-only generated views
- **Marker files for session tracking** - Ultra-simple active session management
- **Unified file structure definition** - Same structure template for all workflows, created on-demand
- **Dynamic task decomposition** - Subtasks created as needed during execution
- **On-demand file creation** - Directories and files created only when required
- **Agent-agnostic task definitions** - Complete context preserved for autonomous execution
## Session Management
### Directory-Based Session Management
**Simple Location-Based Tracking**: Sessions in `.workflow/active/` directory
```bash
.workflow/
├── active/
│ ├── WFS-oauth-integration/ # Active session directory
│ ├── WFS-user-profile/ # Active session directory
│ └── WFS-bug-fix-123/ # Active session directory
└── archives/
└── WFS-old-feature/ # Archived session (completed)
```
### Session Operations
#### Detect Active Session(s)
```bash
active_sessions=$(find .workflow/active/ -name "WFS-*" -type d 2>/dev/null)
count=$(echo "$active_sessions" | wc -l)
if [ -z "$active_sessions" ]; then
echo "No active session"
elif [ "$count" -eq 1 ]; then
session_name=$(basename "$active_sessions")
echo "Active session: $session_name"
else
echo "Multiple sessions found:"
echo "$active_sessions" | while read session_dir; do
session=$(basename "$session_dir")
echo " - $session"
done
echo "Please specify which session to work with"
fi
```
#### Archive Session
```bash
mv .workflow/active/WFS-feature .workflow/archives/WFS-feature
```
### Session State Tracking
Each session directory contains `workflow-session.json`:
```json
{
"session_id": "WFS-[topic-slug]",
"project": "feature description",
"type": "simple|medium|complex",
"current_phase": "PLAN|IMPLEMENT|REVIEW",
"status": "active|paused|completed",
"progress": {
"completed_phases": ["PLAN"],
"current_tasks": ["IMPL-1", "IMPL-2"]
}
}
```
## Task System
### Hierarchical Task Structure
**Maximum Depth**: 2 levels (IMPL-N.M format)
```
IMPL-1 # Main task
IMPL-1.1 # Subtask of IMPL-1 (dynamically created)
IMPL-1.2 # Another subtask of IMPL-1
IMPL-2 # Another main task
IMPL-2.1 # Subtask of IMPL-2 (dynamically created)
```
**Task Status Rules**:
- **Container tasks**: Parent tasks with subtasks (cannot be directly executed)
- **Leaf tasks**: Only these can be executed directly
- **Status inheritance**: Parent status derived from subtask completion
### Enhanced Task JSON Schema
All task files use this unified 6-field schema with optional artifacts enhancement:
```json
{
"id": "IMPL-1.2",
"title": "Implement JWT authentication",
"status": "pending|active|completed|blocked|container",
"context_package_path": ".workflow/WFS-session/.process/context-package.json",
"meta": {
"type": "feature|bugfix|refactor|test-gen|test-fix|docs",
"agent": "@code-developer|@action-planning-agent|@test-fix-agent|@universal-executor"
},
"context": {
"requirements": ["JWT authentication", "OAuth2 support"],
"focus_paths": ["src/auth", "tests/auth", "config/auth.json"],
"acceptance": ["JWT validation works", "OAuth flow complete"],
"parent": "IMPL-1",
"depends_on": ["IMPL-1.1"],
"inherited": {
"from": "IMPL-1",
"context": ["Authentication system design completed"]
},
"shared_context": {
"auth_strategy": "JWT with refresh tokens"
},
"artifacts": [
{
"type": "role_analyses",
"source": "brainstorm_clarification",
"path": ".workflow/WFS-session/.brainstorming/*/analysis*.md",
"priority": "highest",
"contains": "role_specific_requirements_and_design"
}
]
},
"flow_control": {
"pre_analysis": [
{
"step": "check_patterns",
"action": "Analyze existing patterns",
"command": "bash(rg 'auth' [focus_paths] | head -10)",
"output_to": "patterns"
},
{
"step": "analyze_architecture",
"action": "Review system architecture",
"command": "gemini \"analyze patterns: [patterns]\"",
"output_to": "design"
},
{
"step": "check_deps",
"action": "Check dependencies",
"command": "bash(echo [depends_on] | xargs cat)",
"output_to": "context"
}
],
"implementation_approach": [
{
"step": 1,
"title": "Set up authentication infrastructure",
"description": "Install JWT library and create auth config following [design] patterns from [parent]",
"modification_points": [
"Add JWT library dependencies to package.json",
"Create auth configuration file using [parent] patterns"
],
"logic_flow": [
"Install jsonwebtoken library via npm",
"Configure JWT secret and expiration from [inherited]",
"Export auth config for use by [jwt_generator]"
],
"depends_on": [],
"output": "auth_config"
},
{
"step": 2,
"title": "Implement JWT generation",
"description": "Create JWT token generation logic using [auth_config] and [inherited] validation patterns",
"modification_points": [
"Add JWT generation function in auth service",
"Implement token signing with [auth_config]"
],
"logic_flow": [
"User login → validate credentials with [inherited]",
"Generate JWT payload with user data",
"Sign JWT using secret from [auth_config]",
"Return signed token"
],
"depends_on": [1],
"output": "jwt_generator"
},
{
"step": 3,
"title": "Implement JWT validation middleware",
"description": "Create middleware to validate JWT tokens using [auth_config] and [shared] rules",
"modification_points": [
"Create validation middleware using [jwt_generator]",
"Add token verification using [shared] rules",
"Implement user attachment to request object"
],
"logic_flow": [
"Protected route → extract JWT from Authorization header",
"Validate token signature using [auth_config]",
"Check token expiration and [shared] rules",
"Decode payload and attach user to request",
"Call next() or return 401 error"
],
"command": "bash(npm test -- middleware.test.ts)",
"depends_on": [1, 2],
"output": "auth_middleware"
}
],
"target_files": [
"src/auth/login.ts:handleLogin:75-120",
"src/middleware/auth.ts:validateToken",
"src/auth/PasswordReset.ts"
]
}
}
```
### Focus Paths & Context Management
#### Context Package Path (Top-Level Field)
The **context_package_path** field provides the location of the smart context package:
- **Location**: Top-level field (not in `artifacts` array)
- **Path**: `.workflow/WFS-session/.process/context-package.json`
- **Purpose**: References the comprehensive context package containing project structure, dependencies, and brainstorming artifacts catalog
- **Usage**: Loaded in `pre_analysis` steps via `Read({{context_package_path}})`
#### Focus Paths Format
The **focus_paths** field specifies concrete project paths for task implementation:
- **Array of strings**: `["folder1", "folder2", "specific_file.ts"]`
- **Concrete paths**: Use actual directory/file names without wildcards
- **Mixed types**: Can include both directories and specific files
- **Relative paths**: From project root (e.g., `src/auth`, not `./src/auth`)
#### Artifacts Field ⚠️ NEW FIELD
Optional field referencing brainstorming outputs for task execution:
```json
"artifacts": [
{
"type": "role_analyses|topic_framework|individual_role_analysis",
"source": "brainstorm_clarification|brainstorm_framework|brainstorm_roles",
"path": ".workflow/WFS-session/.brainstorming/document.md",
"priority": "highest|high|medium|low"
}
]
```
**Types & Priority**: role_analyses (highest) → topic_framework (medium) → individual_role_analysis (low)
#### Flow Control Configuration
The **flow_control** field manages task execution through structured sequential steps. For complete format specifications and usage guidelines, see [Flow Control Format Guide](#flow-control-format-guide) below.
**Quick Reference**:
- **pre_analysis**: Context gathering steps (supports multiple command types)
- **implementation_approach**: Implementation steps array with dependency management
- **target_files**: Target files for modification (file:function:lines format)
- **Variable references**: Use `[variable_name]` to reference step outputs
- **Tool integration**: Supports Gemini, Codex, Bash commands, and MCP tools
## Flow Control Format Guide
The `[FLOW_CONTROL]` marker indicates that a task or prompt contains flow control steps for sequential execution. There are **two distinct formats** used in different scenarios:
### Format Comparison Matrix
| Aspect | Inline Format | JSON Format |
|--------|--------------|-------------|
| **Used In** | Brainstorm workflows | Implementation tasks |
| **Agent** | conceptual-planning-agent | code-developer, test-fix-agent, doc-generator |
| **Location** | Task() prompt (markdown) | .task/IMPL-*.json file |
| **Persistence** | Temporary (prompt-only) | Persistent (file storage) |
| **Complexity** | Simple (3-5 steps) | Complex (10+ steps) |
| **Dependencies** | None | Full `depends_on` support |
| **Purpose** | Load brainstorming context | Implement task with preparation |
### Inline Format (Brainstorm)
**Marker**: `[FLOW_CONTROL]` written directly in Task() prompt
**Structure**: Markdown list format
**Used By**: Brainstorm commands (`auto-parallel.md`, role commands)
**Agent**: `conceptual-planning-agent`
**Example**:
```markdown
[FLOW_CONTROL]
### Flow Control Steps
**AGENT RESPONSIBILITY**: Execute these pre_analysis steps sequentially with context accumulation:
1. **load_topic_framework**
- Action: Load structured topic discussion framework
- Command: Read(.workflow/WFS-{session}/.brainstorming/guidance-specification.md)
- Output: topic_framework
2. **load_role_template**
- Action: Load role-specific planning template
- Command: bash($(cat "~/.ccw/workflows/cli-templates/planning-roles/{role}.md"))
- Output: role_template
3. **load_session_metadata**
- Action: Load session metadata and topic description
- Command: bash(cat .workflow/WFS-{session}/workflow-session.json 2>/dev/null || echo '{}')
- Output: session_metadata
```
**Characteristics**:
- 3-5 simple context loading steps
- Written directly in prompt (not persistent)
- No dependency management between steps
- Used for temporary context preparation
- Variables: `[variable_name]` for output references
### JSON Format (Implementation)
**Marker**: `[FLOW_CONTROL]` used in TodoWrite or documentation to indicate task has flow control
**Structure**: Complete JSON structure in task file
**Used By**: Implementation tasks (IMPL-*.json)
**Agents**: `code-developer`, `test-fix-agent`, `doc-generator`
**Example**:
```json
"flow_control": {
"pre_analysis": [
{
"step": "load_role_analyses",
"action": "Load role analysis documents from brainstorming",
"commands": [
"bash(ls .workflow/WFS-{session}/.brainstorming/*/analysis*.md 2>/dev/null || echo 'not found')",
"Glob(.workflow/WFS-{session}/.brainstorming/*/analysis*.md)",
"Read(each discovered role analysis file)"
],
"output_to": "role_analyses",
"on_error": "skip_optional"
},
{
"step": "local_codebase_exploration",
"action": "Explore codebase using local search",
"commands": [
"bash(rg '^(function|class|interface).*auth' --type ts -n --max-count 15)",
"bash(find . -name '*auth*' -type f | grep -v node_modules | head -10)"
],
"output_to": "codebase_structure"
}
],
"implementation_approach": [
{
"step": 1,
"title": "Setup infrastructure",
"description": "Install JWT library and create config following [role_analyses]",
"modification_points": [
"Add JWT library dependencies to package.json",
"Create auth configuration file"
],
"logic_flow": [
"Install jsonwebtoken library via npm",
"Configure JWT secret from [role_analyses]",
"Export auth config for use by [jwt_generator]"
],
"depends_on": [],
"output": "auth_config"
},
{
"step": 2,
"title": "Implement JWT generation",
"description": "Create JWT token generation logic using [auth_config]",
"modification_points": [
"Add JWT generation function in auth service",
"Implement token signing with [auth_config]"
],
"logic_flow": [
"User login → validate credentials",
"Generate JWT payload with user data",
"Sign JWT using secret from [auth_config]",
"Return signed token"
],
"depends_on": [1],
"output": "jwt_generator"
}
],
"target_files": [
"src/auth/login.ts:handleLogin:75-120",
"src/middleware/auth.ts:validateToken"
]
}
```
**Characteristics**:
- Persistent storage in .task/IMPL-*.json files
- Complete dependency management (`depends_on` arrays)
- Two-phase structure: `pre_analysis` + `implementation_approach`
- Error handling strategies (`on_error` field)
- Target file specifications
- Variables: `[variable_name]` for cross-step references
### JSON Format Field Specifications
#### pre_analysis Field
**Purpose**: Context gathering phase before implementation
**Structure**: Array of step objects with sequential execution
**Step Fields**:
- **step**: Step identifier (string, e.g., "load_role_analyses")
- **action**: Human-readable description of the step
- **command** or **commands**: Single command string or array of command strings
- **output_to**: Variable name for storing step output
- **on_error**: Error handling strategy (`skip_optional`, `fail`, `retry_once`, `manual_intervention`)
**Command Types Supported**:
- **Bash commands**: `bash(command)` - Any shell command
- **Tool calls**: `Read(file)`, `Glob(pattern)`, `Grep(pattern)`
- **MCP tools**: `mcp__exa__get_code_context_exa()`, `mcp__exa__web_search_exa()`
- **CLI commands**: `gemini`, `qwen`, `codex --full-auto exec`
**Example**:
```json
{
"step": "load_context",
"action": "Load project context and patterns",
"commands": [
"bash(ccw tool exec get_modules_by_depth '{}')",
"Read(CLAUDE.md)"
],
"output_to": "project_structure",
"on_error": "skip_optional"
}
```
#### implementation_approach Field
**Purpose**: Define implementation steps with dependency management
**Structure**: Array of step objects (NOT object format)
**Step Fields (All Required)**:
- **step**: Unique step number (1, 2, 3, ...) - serves as step identifier
- **title**: Brief step title
- **description**: Comprehensive implementation description with context variable references
- **modification_points**: Array of specific code modification targets
- **logic_flow**: Array describing business logic execution sequence
- **depends_on**: Array of step numbers this step depends on (e.g., `[1]`, `[1, 2]`) - empty array `[]` for independent steps
- **output**: Output variable name that can be referenced by subsequent steps via `[output_name]`
**Optional Fields**:
- **command**: Command for step execution (supports any shell command or CLI tool)
- When omitted: Agent interprets modification_points and logic_flow to execute
- When specified: Command executes the step directly
**Execution Modes**:
- **Default (without command)**: Agent executes based on modification_points and logic_flow
- **With command**: Specified command handles execution
**Command Field Usage**:
- **Default approach**: Omit command field - let agent execute autonomously
- **CLI tools (codex/gemini/qwen)**: Add ONLY when user explicitly requests CLI tool usage
- **Simple commands**: Can include bash commands, test commands, validation scripts
- **Complex workflows**: Use command for multi-step operations or tool coordination
**Command Format Examples** (only when explicitly needed):
```json
// Simple Bash
"command": "bash(npm install package)"
"command": "bash(npm test)"
// Validation
"command": "bash(test -f config.ts && grep -q 'JWT_SECRET' config.ts)"
// Codex (user requested)
"command": "codex -C path --full-auto exec \"task\" --skip-git-repo-check -s danger-full-access"
// Codex Resume (user requested, maintains context)
"command": "codex --full-auto exec \"task\" resume --last --skip-git-repo-check -s danger-full-access"
// Gemini (user requested)
"command": "gemini \"analyze [context]\""
// Qwen (fallback for Gemini)
"command": "qwen \"analyze [context]\""
```
**Example Step**:
```json
{
"step": 2,
"title": "Implement JWT generation",
"description": "Create JWT token generation logic using [auth_config]",
"modification_points": [
"Add JWT generation function in auth service",
"Implement token signing with [auth_config]"
],
"logic_flow": [
"User login → validate credentials",
"Generate JWT payload with user data",
"Sign JWT using secret from [auth_config]",
"Return signed token"
],
"depends_on": [1],
"output": "jwt_generator"
}
```
#### target_files Field
**Purpose**: Specify files to be modified or created
**Format**: Array of strings
- **Existing files**: `"file:function:lines"` (e.g., `"src/auth/login.ts:handleLogin:75-120"`)
- **New files**: `"path/to/NewFile.ts"` (file path only)
### Tool Reference
**Available Command Types**:
**Gemini CLI**:
```bash
gemini "prompt"
gemini --approval-mode yolo "prompt" # For write mode
```
**Qwen CLI** (Gemini fallback):
```bash
qwen "prompt"
qwen --approval-mode yolo "prompt" # For write mode
```
**Codex CLI**:
```bash
codex -C directory --full-auto exec "task" --skip-git-repo-check -s danger-full-access
codex --full-auto exec "task" resume --last --skip-git-repo-check -s danger-full-access
```
**Built-in Tools**:
- `Read(file_path)` - Read file contents
- `Glob(pattern)` - Find files by pattern
- `Grep(pattern)` - Search content with regex
- `bash(command)` - Execute bash command
**MCP Tools**:
- `mcp__exa__get_code_context_exa(query="...")` - Get code context from Exa
- `mcp__exa__web_search_exa(query="...")` - Web search via Exa
**Bash Commands**:
```bash
bash(rg 'pattern' src/)
bash(find . -name "*.ts")
bash(npm test)
bash(git log --oneline | head -5)
```
### Variable System & Context Flow
**Variable Reference Syntax**:
Both formats use `[variable_name]` syntax for referencing outputs from previous steps.
**Variable Types**:
- **Step outputs**: `[step_output_name]` - Reference any pre_analysis step output
- **Task properties**: `[task_property]` - Reference any task context field
- **Previous results**: `[analysis_result]` - Reference accumulated context
- **Implementation outputs**: Reference outputs from previous implementation steps
**Examples**:
```json
// Reference pre_analysis output
"description": "Install JWT library following [role_analyses]"
// Reference previous step output
"description": "Create middleware using [auth_config] and [jwt_generator]"
// Reference task context
"command": "bash(cd [focus_paths] && npm test)"
```
**Context Accumulation Process**:
1. **Structure Analysis**: `get_modules_by_depth.sh` → project hierarchy
2. **Pattern Analysis**: Tool-specific commands → existing patterns
3. **Dependency Mapping**: Previous task summaries → inheritance context
4. **Task Context Generation**: Combined analysis → task.context fields
**Context Inheritance Rules**:
- **Parent → Child**: Container tasks pass context via `context.inherited`
- **Dependency → Dependent**: Previous task summaries via `context.depends_on`
- **Session → Task**: Global session context included in all tasks
- **Module → Feature**: Module patterns inform feature implementation
### Agent Processing Rules
**conceptual-planning-agent** (Inline Format):
- Parses markdown list from prompt
- Executes 3-5 simple loading steps
- No dependency resolution needed
- Accumulates context in variables
- Used only in brainstorm workflows
**code-developer, test-fix-agent** (JSON Format):
- Loads complete task JSON from file
- Executes `pre_analysis` steps sequentially
- Processes `implementation_approach` with dependency resolution
- Handles complex variable substitution
- Updates task status in JSON file
### Usage Guidelines
**Use Inline Format When**:
- Running brainstorm workflows
- Need 3-5 simple context loading steps
- No persistence required
- No dependencies between steps
- Temporary context preparation
**Use JSON Format When**:
- Implementing features or tasks
- Need 10+ complex execution steps
- Require dependency management
- Need persistent task definitions
- Complex variable flow between steps
- Error handling strategies needed
### Variable Reference Syntax
Both formats use `[variable_name]` syntax for referencing outputs:
**Inline Format**:
```markdown
2. **analyze_context**
- Action: Analyze using [topic_framework] and [role_template]
- Output: analysis_results
```
**JSON Format**:
```json
{
"step": 2,
"description": "Implement following [role_analyses] and [codebase_structure]",
"depends_on": [1],
"output": "implementation"
}
```
### Task Validation Rules
1. **ID Uniqueness**: All task IDs must be unique
2. **Hierarchical Format**: Must follow IMPL-N[.M] pattern (maximum 2 levels)
3. **Parent References**: All parent IDs must exist as JSON files
4. **Status Consistency**: Status values from defined enumeration
5. **Required Fields**: All 5 core fields must be present (id, title, status, meta, context, flow_control)
6. **Focus Paths Structure**: context.focus_paths must contain concrete paths (no wildcards)
7. **Flow Control Format**: pre_analysis must be array with required fields
8. **Dependency Integrity**: All task-level depends_on references must exist as JSON files
9. **Artifacts Structure**: context.artifacts (optional) must use valid type, priority, and path format
10. **Implementation Steps Array**: implementation_approach must be array of step objects
11. **Step Number Uniqueness**: All step numbers within a task must be unique and sequential (1, 2, 3, ...)
12. **Step Dependencies**: All step-level depends_on numbers must reference valid steps within same task
13. **Step Sequence**: Step numbers should match array order (first item step=1, second item step=2, etc.)
14. **Step Required Fields**: Each step must have step, title, description, modification_points, logic_flow, depends_on, output
15. **Step Optional Fields**: command field is optional - when omitted, agent executes based on modification_points and logic_flow
## Workflow Structure
### Unified File Structure
All workflows use the same file structure definition regardless of complexity. **Directories and files are created on-demand as needed**, not all at once during initialization.
#### Complete Structure Reference
```
.workflow/
├── [.scratchpad/] # Non-session-specific outputs (created when needed)
│ ├── analyze-*-[timestamp].md # One-off analysis results
│ ├── chat-*-[timestamp].md # Standalone chat sessions
│ ├── plan-*-[timestamp].md # Ad-hoc planning notes
│ ├── bug-index-*-[timestamp].md # Quick bug analyses
│ ├── code-analysis-*-[timestamp].md # Standalone code analysis
│ ├── execute-*-[timestamp].md # Ad-hoc implementation logs
│ └── codex-execute-*-[timestamp].md # Multi-stage execution logs
├── [design-run-*/] # Standalone UI design outputs (created when needed)
│ └── (timestamped)/ # Timestamped design runs without session
│ ├── .intermediates/ # Intermediate analysis files
│ │ ├── style-analysis/ # Style analysis data
│ │ │ ├── computed-styles.json # Extracted CSS values
│ │ │ └── design-space-analysis.json # Design directions
│ │ └── layout-analysis/ # Layout analysis data
│ │ ├── dom-structure-{target}.json # DOM extraction
│ │ └── inspirations/ # Layout research
│ │ └── {target}-layout-ideas.txt
│ ├── style-extraction/ # Final design systems
│ │ ├── style-1/ # design-tokens.json, style-guide.md
│ │ └── style-N/
│ ├── layout-extraction/ # Layout templates
│ │ └── layout-templates.json
│ ├── prototypes/ # Generated HTML/CSS prototypes
│ │ ├── {target}-style-{s}-layout-{l}.html # Final prototypes
│ │ ├── compare.html # Interactive matrix view
│ │ └── index.html # Navigation page
│ └── .run-metadata.json # Run configuration
├── active/ # Active workflow sessions
│ └── WFS-[topic-slug]/
│ ├── workflow-session.json # Session metadata and state (REQUIRED)
│ ├── [.brainstorming/] # Optional brainstorming phase (created when needed)
│ ├── [.chat/] # CLI interaction sessions (created when analysis is run)
│ │ ├── chat-*.md # Saved chat sessions
│ │ └── analysis-*.md # Analysis results
│ ├── [.process/] # Planning analysis results (created by /workflow-plan)
│ │ └── ANALYSIS_RESULTS.md # Analysis results and planning artifacts
│ ├── IMPL_PLAN.md # Planning document (REQUIRED)
│ ├── TODO_LIST.md # Progress tracking (REQUIRED)
│ ├── [.summaries/] # Task completion summaries (created when tasks complete)
│ │ ├── IMPL-*-summary.md # Main task summaries
│ │ └── IMPL-*.*-summary.md # Subtask summaries
│ ├── [.review/] # Code review results (created by review commands)
│ │ ├── review-metadata.json # Review configuration and scope
│ │ ├── review-state.json # Review state machine
│ │ ├── review-progress.json # Real-time progress tracking
│ │ ├── dimensions/ # Per-dimension analysis results
│ │ ├── iterations/ # Deep-dive iteration results
│ │ ├── reports/ # Human-readable reports and CLI outputs
│ │ ├── REVIEW-SUMMARY.md # Final consolidated summary
│ │ └── dashboard.html # Interactive review dashboard
│ ├── [design-*/] # UI design outputs (created by ui-design workflows)
│ │ ├── .intermediates/ # Intermediate analysis files
│ │ │ ├── style-analysis/ # Style analysis data
│ │ │ │ ├── computed-styles.json # Extracted CSS values
│ │ │ │ └── design-space-analysis.json # Design directions
│ │ │ └── layout-analysis/ # Layout analysis data
│ │ │ ├── dom-structure-{target}.json # DOM extraction
│ │ │ └── inspirations/ # Layout research
│ │ │ └── {target}-layout-ideas.txt
│ │ ├── style-extraction/ # Final design systems
│ │ │ ├── style-1/ # design-tokens.json, style-guide.md
│ │ │ └── style-N/
│ │ ├── layout-extraction/ # Layout templates
│ │ │ └── layout-templates.json
│ │ ├── prototypes/ # Generated HTML/CSS prototypes
│ │ │ ├── {target}-style-{s}-layout-{l}.html # Final prototypes
│ │ │ ├── compare.html # Interactive matrix view
│ │ │ └── index.html # Navigation page
│ │ └── .run-metadata.json # Run configuration
│ └── .task/ # Task definitions (REQUIRED)
│ ├── IMPL-*.json # Main task definitions
│ └── IMPL-*.*.json # Subtask definitions (created dynamically)
└── archives/ # Completed workflow sessions
└── WFS-[completed-topic]/ # Archived session directories
```
#### Creation Strategy
- **Initial Setup**: Create only `workflow-session.json`, `IMPL_PLAN.md`, `TODO_LIST.md`, and `.task/` directory
- **On-Demand Creation**: Other directories created when first needed
- **Dynamic Files**: Subtask JSON files created during task decomposition
- **Scratchpad Usage**: `.scratchpad/` created when CLI commands run without active session
- **Design Usage**: `design-{timestamp}/` created by UI design workflows in `.workflow/` directly for standalone design runs
- **Review Usage**: `.review/` created by review commands (`/workflow:review-module-cycle`, `/workflow:review-session-cycle`) for comprehensive code quality analysis
- **Intermediate Files**: `.intermediates/` contains analysis data (style/layout) separate from final deliverables
- **Layout Templates**: `layout-extraction/layout-templates.json` contains structural templates for UI assembly
#### Scratchpad Directory (.scratchpad/)
**Purpose**: Centralized location for non-session-specific CLI outputs
**When to Use**:
1. **No Active Session**: CLI analysis/chat commands run without an active workflow session
2. **Unrelated Analysis**: Quick analysis not related to current active session
3. **Exploratory Work**: Ad-hoc investigation before creating formal workflow
4. **One-Off Queries**: Standalone questions or debugging without workflow context
**Output Routing Logic**:
- **IF** active session exists in `.workflow/active/` AND command is session-relevant:
- Save to `.workflow/active/WFS-[id]/.chat/[command]-[timestamp].md`
- **ELSE** (no session OR one-off analysis):
- Save to `.workflow/.scratchpad/[command]-[description]-[timestamp].md`
**File Naming Pattern**: `[command-type]-[brief-description]-[timestamp].md`
**Examples**:
*Workflow Commands (lightweight):*
- `/workflow-lite-plan "feature idea"` (exploratory) → `.scratchpad/lite-plan-feature-idea-20250105-143110.md`
- `/workflow:lite-fix "bug description"` (bug fixing) → `.scratchpad/lite-fix-bug-20250105-143130.md`
> **Note**: Direct CLI commands (`/cli:analyze`, `/cli:execute`, etc.) have been replaced by semantic invocation and workflow commands.
**Maintenance**:
- Periodically review and clean up old scratchpad files
- Promote useful analyses to formal workflow sessions if needed
- No automatic cleanup - manual management recommended
### File Naming Conventions
#### Session Identifiers
**Format**: `WFS-[topic-slug]`
**WFS Prefix Meaning**:
- `WFS` = **W**ork**F**low **S**ession
- Identifies directories as workflow session containers
- Distinguishes workflow sessions from other project directories
**Naming Rules**:
- Convert topic to lowercase with hyphens (e.g., "User Auth System" → `WFS-user-auth-system`)
- Add `-NNN` suffix only if conflicts exist (e.g., `WFS-payment-integration-002`)
- Maximum length: 50 characters including WFS- prefix
#### Document Naming
- `workflow-session.json` - Session state (required)
- `IMPL_PLAN.md` - Planning document (required)
- `TODO_LIST.md` - Progress tracking (auto-generated when needed)
- Chat sessions: `chat-analysis-*.md`
- Task summaries: `IMPL-[task-id]-summary.md`
### Document Templates
#### TODO_LIST.md Template
```markdown
# Tasks: [Session Topic]
## Task Progress
**IMPL-001**: [Main Task Group] → [📋](./.task/IMPL-001.json)
- [ ] **IMPL-001.1**: [Subtask] → [📋](./.task/IMPL-001.1.json)
- [x] **IMPL-001.2**: [Subtask] → [📋](./.task/IMPL-001.2.json) | [](./.summaries/IMPL-001.2-summary.md)
- [x] **IMPL-002**: [Simple Task] → [📋](./.task/IMPL-002.json) | [](./.summaries/IMPL-002-summary.md)
## Status Legend
- `▸` = Container task (has subtasks)
- `- [ ]` = Pending leaf task
- `- [x]` = Completed leaf task
- Maximum 2 levels: Main tasks and subtasks only
```
## Operations Guide
### Session Management
```bash
# Create minimal required structure
mkdir -p .workflow/active/WFS-topic-slug/.task
echo '{"session_id":"WFS-topic-slug",...}' > .workflow/active/WFS-topic-slug/workflow-session.json
echo '# Implementation Plan' > .workflow/active/WFS-topic-slug/IMPL_PLAN.md
echo '# Tasks' > .workflow/active/WFS-topic-slug/TODO_LIST.md
```
### Task Operations
```bash
# Create task
echo '{"id":"IMPL-1","title":"New task",...}' > .task/IMPL-1.json
# Update task status
jq '.status = "active"' .task/IMPL-1.json > temp && mv temp .task/IMPL-1.json
# Generate TODO list from JSON state
generate_todo_list_from_json .task/
```
### Directory Creation (On-Demand)
```bash
mkdir -p .brainstorming # When brainstorming is initiated
mkdir -p .chat # When analysis commands are run
mkdir -p .summaries # When first task completes
```
### Session Consistency Checks & Recovery
```bash
# Validate session directory structure
if [ -d ".workflow/active/" ]; then
for session_dir in .workflow/active/WFS-*; do
if [ ! -f "$session_dir/workflow-session.json" ]; then
echo "⚠️ Missing workflow-session.json in $session_dir"
fi
done
fi
```
**Recovery Strategies**:
- **Missing Session File**: Recreate workflow-session.json from template
- **Corrupted Session File**: Restore from template with basic metadata
- **Broken Task Hierarchy**: Reconstruct parent-child relationships from task JSON files
- **Orphaned Sessions**: Move incomplete sessions to archives/
## Complexity Classification
### Task Complexity Rules
**Complexity is determined by task count and decomposition needs:**
| Complexity | Task Count | Hierarchy Depth | Decomposition Behavior |
|------------|------------|----------------|----------------------|
| **Simple** | <5 tasks | 1 level (IMPL-N) | Direct execution, minimal decomposition |
| **Medium** | 5-15 tasks | 2 levels (IMPL-N.M) | Moderate decomposition, context coordination |
| **Complex** | >15 tasks | 2 levels (IMPL-N.M) | Frequent decomposition, multi-agent orchestration |
### Workflow Characteristics & Tool Guidance
#### Simple Workflows
- **Examples**: Bug fixes, small feature additions, configuration changes
- **Task Decomposition**: Usually single-level tasks, minimal breakdown needed
- **Agent Coordination**: Direct execution without complex orchestration
- **Tool Strategy**: `bash()` commands, `grep()` for pattern matching
#### Medium Workflows
- **Examples**: New features, API endpoints with integration, database schema changes
- **Task Decomposition**: Two-level hierarchy when decomposition is needed
- **Agent Coordination**: Context coordination between related tasks
- **Tool Strategy**: `gemini` for pattern analysis, `codex --full-auto` for implementation
#### Complex Workflows
- **Examples**: Major features, architecture refactoring, security implementations, multi-service deployments
- **Task Decomposition**: Frequent use of two-level hierarchy with dynamic subtask creation
- **Agent Coordination**: Multi-agent orchestration with deep context analysis
- **Tool Strategy**: `gemini` for architecture analysis, `codex --full-auto` for complex problem solving, `bash()` commands for flexible analysis
### Assessment & Upgrades
- **During Creation**: System evaluates requirements and assigns complexity
- **During Execution**: Can upgrade (Simple→Medium→Complex) but never downgrade
- **Override Allowed**: Users can specify higher complexity manually
## Agent Integration
### Agent Assignment
Based on task type and title keywords:
- **Planning tasks** → @action-planning-agent
- **Implementation** → @code-developer (code + tests)
- **Test execution/fixing** → @test-fix-agent
- **Review** → @universal-executor (optional, only when explicitly requested)
### Execution Context
Agents receive complete task JSON plus workflow context:
```json
{
"task": { /* complete task JSON */ },
"workflow": {
"session": "WFS-user-auth",
"phase": "IMPLEMENT"
}
}
```

18
.gitignore vendored
View File

@@ -143,3 +143,21 @@ ccw/.tmp-ccw-auth-home/
docs/node_modules/
docs/.vitepress/dist/
docs/.vitepress/cache/
codex-lens/.cache/huggingface/hub/models--Xenova--ms-marco-MiniLM-L-6-v2/refs/main
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/.gitattributes
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/config.json
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/quantize_config.json
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/README.md
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/special_tokens_map.json
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/tokenizer_config.json
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/tokenizer.json
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/vocab.txt
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_bnb4.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_fp16.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_int8.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_q4.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_q4f16.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_quantized.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model_uint8.onnx
codex-lens/.cache/huggingface/models/Xenova--ms-marco-MiniLM-L-6-v2/onnx/model.onnx
codex-lens/data/registry.db

View File

@@ -0,0 +1,5 @@
import sys
import os
# Ensure the local src directory takes precedence over any installed codexlens package
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "src"))

View File

@@ -0,0 +1,36 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "codex-lens-v2"
version = "0.1.0"
description = "Minimal code semantic search library with 2-stage pipeline"
requires-python = ">=3.10"
dependencies = []
[project.optional-dependencies]
semantic = [
"hnswlib>=0.8.0",
"numpy>=1.26",
"fastembed>=0.4.0,<2.0",
]
gpu = [
"onnxruntime-gpu>=1.16",
]
faiss-cpu = [
"faiss-cpu>=1.7.4",
]
faiss-gpu = [
"faiss-gpu>=1.7.4",
]
reranker-api = [
"httpx>=0.25",
]
dev = [
"pytest>=7.0",
"pytest-cov",
]
[tool.hatch.build.targets.wheel]
packages = ["src/codexlens"]

View File

@@ -0,0 +1,128 @@
"""
对 D:/Claude_dms3 仓库进行索引并测试搜索。
用法: python scripts/index_and_search.py
"""
import sys
import time
from pathlib import Path
# 确保 src 可被导入
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from codexlens.config import Config
from codexlens.core.factory import create_ann_index, create_binary_index
from codexlens.embed.local import FastEmbedEmbedder
from codexlens.indexing import IndexingPipeline
from codexlens.rerank.local import FastEmbedReranker
from codexlens.search.fts import FTSEngine
from codexlens.search.pipeline import SearchPipeline
# ─── 配置 ──────────────────────────────────────────────────────────────────
REPO_ROOT = Path("D:/Claude_dms3")
INDEX_DIR = Path("D:/Claude_dms3/codex-lens-v2/.index_cache")
EXTENSIONS = {".py", ".ts", ".js", ".md"}
MAX_FILE_SIZE = 50_000 # bytes
MAX_CHUNK_CHARS = 800 # 每个 chunk 的最大字符数
CHUNK_OVERLAP = 100
# ─── 文件收集 ───────────────────────────────────────────────────────────────
SKIP_DIRS = {
".git", "node_modules", "__pycache__", ".pytest_cache",
"dist", "build", ".venv", "venv", ".cache", ".index_cache",
"codex-lens-v2", # 不索引自身
}
def collect_files(root: Path) -> list[Path]:
files = []
for p in root.rglob("*"):
if any(part in SKIP_DIRS for part in p.parts):
continue
if p.is_file() and p.suffix in EXTENSIONS:
if p.stat().st_size <= MAX_FILE_SIZE:
files.append(p)
return files
# ─── 主流程 ─────────────────────────────────────────────────────────────────
def main():
INDEX_DIR.mkdir(parents=True, exist_ok=True)
# 1. 使用小 profile 加快速度
config = Config(
embed_model="BAAI/bge-small-en-v1.5",
embed_dim=384,
embed_batch_size=32,
hnsw_ef=100,
hnsw_M=16,
binary_top_k=100,
ann_top_k=30,
reranker_top_k=10,
)
print("=== codex-lens-v2 索引测试 ===\n")
# 2. 收集文件
print(f"[1/4] 扫描 {REPO_ROOT} ...")
files = collect_files(REPO_ROOT)
print(f" 找到 {len(files)} 个文件")
# 3. 初始化组件
print(f"\n[2/4] 加载嵌入模型 (bge-small-en-v1.5, dim=384) ...")
embedder = FastEmbedEmbedder(config)
binary_store = create_binary_index(INDEX_DIR, config.embed_dim, config)
ann_index = create_ann_index(INDEX_DIR, config.embed_dim, config)
fts = FTSEngine(":memory:") # 内存 FTS不持久化
# 4. 使用 IndexingPipeline 并行索引 (chunk -> embed -> index)
print(f"[3/4] 并行索引 {len(files)} 个文件 ...")
pipeline = IndexingPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
fts=fts,
config=config,
)
stats = pipeline.index_files(
files,
root=REPO_ROOT,
max_chunk_chars=MAX_CHUNK_CHARS,
chunk_overlap=CHUNK_OVERLAP,
max_file_size=MAX_FILE_SIZE,
)
print(f" 索引完成: {stats.files_processed} 文件, {stats.chunks_created} chunks ({stats.duration_seconds:.1f}s)")
# 5. 搜索测试
print(f"\n[4/4] 构建 SearchPipeline ...")
reranker = FastEmbedReranker(config)
pipeline = SearchPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
reranker=reranker,
fts=fts,
config=config,
)
queries = [
"authentication middleware function",
"def embed_single",
"RRF fusion weights",
"fastembed TextCrossEncoder reranker",
"how to search code semantic",
]
print("\n" + "=" * 60)
for query in queries:
t0 = time.time()
results = pipeline.search(query, top_k=5)
elapsed = time.time() - t0
print(f"\nQuery: {query!r} ({elapsed*1000:.0f}ms)")
if results:
for r in results:
print(f" [{r.score:.3f}] {r.path}")
else:
print(" (无结果)")
print("=" * 60)
print("\n测试完成 ✓")
if __name__ == "__main__":
main()

View File

View File

@@ -0,0 +1,99 @@
from __future__ import annotations
import logging
from dataclasses import dataclass, field
log = logging.getLogger(__name__)
@dataclass
class Config:
# Embedding
embed_model: str = "jinaai/jina-embeddings-v2-base-code"
embed_dim: int = 768
embed_batch_size: int = 64
# GPU / execution providers
device: str = "auto" # 'auto', 'cuda', 'cpu'
embed_providers: list[str] | None = None # explicit ONNX providers override
# Backend selection: 'auto', 'faiss', 'hnswlib'
ann_backend: str = "auto"
binary_backend: str = "auto"
# Indexing pipeline
index_workers: int = 2 # number of parallel indexing workers
# HNSW index (ANNIndex)
hnsw_ef: int = 150
hnsw_M: int = 32
hnsw_ef_construction: int = 200
# Binary coarse search (BinaryStore)
binary_top_k: int = 200
# ANN fine search
ann_top_k: int = 50
# Reranker
reranker_model: str = "BAAI/bge-reranker-v2-m3"
reranker_top_k: int = 20
reranker_batch_size: int = 32
# API reranker (optional)
reranker_api_url: str = ""
reranker_api_key: str = ""
reranker_api_model: str = ""
reranker_api_max_tokens_per_batch: int = 2048
# FTS
fts_top_k: int = 50
# Fusion
fusion_k: int = 60 # RRF k parameter
fusion_weights: dict = field(default_factory=lambda: {
"exact": 0.25,
"fuzzy": 0.10,
"vector": 0.50,
"graph": 0.15,
})
def resolve_embed_providers(self) -> list[str]:
"""Return ONNX execution providers based on device config.
Priority: explicit embed_providers > device setting > auto-detect.
"""
if self.embed_providers is not None:
return list(self.embed_providers)
if self.device == "cuda":
return ["CUDAExecutionProvider", "CPUExecutionProvider"]
if self.device == "cpu":
return ["CPUExecutionProvider"]
# auto-detect
try:
import onnxruntime
available = onnxruntime.get_available_providers()
if "CUDAExecutionProvider" in available:
log.info("CUDA detected via onnxruntime, using GPU for embedding")
return ["CUDAExecutionProvider", "CPUExecutionProvider"]
except ImportError:
pass
return ["CPUExecutionProvider"]
@classmethod
def defaults(cls) -> "Config":
return cls()
@classmethod
def small(cls) -> "Config":
"""Smaller config for testing or small corpora."""
return cls(
hnsw_ef=50,
hnsw_M=16,
binary_top_k=50,
ann_top_k=20,
reranker_top_k=10,
)

View File

@@ -0,0 +1,13 @@
from .base import BaseANNIndex, BaseBinaryIndex
from .binary import BinaryStore
from .factory import create_ann_index, create_binary_index
from .index import ANNIndex
__all__ = [
"BaseANNIndex",
"BaseBinaryIndex",
"ANNIndex",
"BinaryStore",
"create_ann_index",
"create_binary_index",
]

View File

@@ -0,0 +1,83 @@
from __future__ import annotations
from abc import ABC, abstractmethod
import numpy as np
class BaseANNIndex(ABC):
"""Abstract base class for approximate nearest neighbor indexes."""
@abstractmethod
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add float32 vectors with corresponding IDs.
Args:
ids: shape (N,) int64
vectors: shape (N, dim) float32
"""
@abstractmethod
def fine_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search for nearest neighbors.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results
Returns:
(ids, distances) as numpy arrays
"""
@abstractmethod
def save(self) -> None:
"""Persist index to disk."""
@abstractmethod
def load(self) -> None:
"""Load index from disk."""
@abstractmethod
def __len__(self) -> int:
"""Return the number of indexed items."""
class BaseBinaryIndex(ABC):
"""Abstract base class for binary vector indexes (Hamming distance)."""
@abstractmethod
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add float32 vectors (will be binary-quantized internally).
Args:
ids: shape (N,) int64
vectors: shape (N, dim) float32
"""
@abstractmethod
def coarse_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search by Hamming distance.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results
Returns:
(ids, distances) sorted ascending by distance
"""
@abstractmethod
def save(self) -> None:
"""Persist store to disk."""
@abstractmethod
def load(self) -> None:
"""Load store from disk."""
@abstractmethod
def __len__(self) -> int:
"""Return the number of stored items."""

View File

@@ -0,0 +1,173 @@
from __future__ import annotations
import logging
import math
from pathlib import Path
import numpy as np
from codexlens.config import Config
from codexlens.core.base import BaseBinaryIndex
logger = logging.getLogger(__name__)
class BinaryStore(BaseBinaryIndex):
"""Persistent binary vector store using numpy memmap.
Stores binary-quantized float32 vectors as packed uint8 arrays on disk.
Supports fast coarse search via XOR + popcount Hamming distance.
"""
def __init__(self, path: str | Path, dim: int, config: Config) -> None:
self._dir = Path(path)
self._dim = dim
self._config = config
self._packed_bytes = math.ceil(dim / 8)
self._bin_path = self._dir / "binary_store.bin"
self._ids_path = self._dir / "binary_store_ids.npy"
self._matrix: np.ndarray | None = None # shape (N, packed_bytes), uint8
self._ids: np.ndarray | None = None # shape (N,), int64
self._count: int = 0
if self._bin_path.exists() and self._ids_path.exists():
self.load()
# ------------------------------------------------------------------
# Internal helpers
# ------------------------------------------------------------------
def _quantize(self, vectors: np.ndarray) -> np.ndarray:
"""Convert float32 vectors (N, dim) to packed uint8 (N, packed_bytes)."""
binary = (vectors > 0).astype(np.uint8)
packed = np.packbits(binary, axis=1)
return packed
def _quantize_single(self, vec: np.ndarray) -> np.ndarray:
"""Convert a single float32 vector (dim,) to packed uint8 (packed_bytes,)."""
binary = (vec > 0).astype(np.uint8)
return np.packbits(binary)
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def _ensure_capacity(self, needed: int) -> None:
"""Grow pre-allocated matrix/ids arrays to fit *needed* total items."""
if self._matrix is not None and self._matrix.shape[0] >= needed:
return
new_cap = max(1024, needed)
# Double until large enough
if self._matrix is not None:
cur_cap = self._matrix.shape[0]
new_cap = max(cur_cap, 1024)
while new_cap < needed:
new_cap *= 2
new_matrix = np.zeros((new_cap, self._packed_bytes), dtype=np.uint8)
new_ids = np.zeros(new_cap, dtype=np.int64)
if self._matrix is not None and self._count > 0:
new_matrix[: self._count] = self._matrix[: self._count]
new_ids[: self._count] = self._ids[: self._count]
self._matrix = new_matrix
self._ids = new_ids
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add float32 vectors and their ids.
Does NOT call save() internally -- callers must call save()
explicitly after batch indexing.
Args:
ids: shape (N,) int64
vectors: shape (N, dim) float32
"""
if len(ids) == 0:
return
packed = self._quantize(vectors) # (N, packed_bytes)
n = len(ids)
self._ensure_capacity(self._count + n)
self._matrix[self._count : self._count + n] = packed
self._ids[self._count : self._count + n] = ids.astype(np.int64)
self._count += n
def coarse_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search by Hamming distance.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results; defaults to config.binary_top_k
Returns:
(ids, distances) sorted ascending by Hamming distance
"""
if self._matrix is None or self._count == 0:
return np.array([], dtype=np.int64), np.array([], dtype=np.int32)
k = top_k if top_k is not None else self._config.binary_top_k
k = min(k, self._count)
query_bin = self._quantize_single(query_vec) # (packed_bytes,)
# Slice to active region (matrix may be pre-allocated larger)
active_matrix = self._matrix[: self._count]
active_ids = self._ids[: self._count]
# XOR then popcount via unpackbits
xor = np.bitwise_xor(active_matrix, query_bin[np.newaxis, :]) # (N, packed_bytes)
dists = np.unpackbits(xor, axis=1).sum(axis=1).astype(np.int32) # (N,)
if k >= self._count:
order = np.argsort(dists)
else:
part = np.argpartition(dists, k)[:k]
order = part[np.argsort(dists[part])]
return active_ids[order], dists[order]
def save(self) -> None:
"""Flush binary store to disk."""
if self._matrix is None or self._count == 0:
return
self._dir.mkdir(parents=True, exist_ok=True)
# Write only the occupied portion of the pre-allocated matrix
active_matrix = self._matrix[: self._count]
mm = np.memmap(
str(self._bin_path),
dtype=np.uint8,
mode="w+",
shape=active_matrix.shape,
)
mm[:] = active_matrix
mm.flush()
del mm
np.save(str(self._ids_path), self._ids[: self._count])
def load(self) -> None:
"""Reload binary store from disk."""
ids = np.load(str(self._ids_path))
n = len(ids)
if n == 0:
return
mm = np.memmap(
str(self._bin_path),
dtype=np.uint8,
mode="r",
shape=(n, self._packed_bytes),
)
self._matrix = np.array(mm) # copy into RAM for mutation support
del mm
self._ids = ids.astype(np.int64)
self._count = n
def __len__(self) -> int:
return self._count

View File

@@ -0,0 +1,116 @@
from __future__ import annotations
import logging
from pathlib import Path
from codexlens.config import Config
from codexlens.core.base import BaseANNIndex, BaseBinaryIndex
logger = logging.getLogger(__name__)
try:
import faiss as _faiss # noqa: F401
_FAISS_AVAILABLE = True
except ImportError:
_FAISS_AVAILABLE = False
try:
import hnswlib as _hnswlib # noqa: F401
_HNSWLIB_AVAILABLE = True
except ImportError:
_HNSWLIB_AVAILABLE = False
def _has_faiss_gpu() -> bool:
"""Check whether faiss-gpu is available (has GPU resources)."""
if not _FAISS_AVAILABLE:
return False
try:
import faiss
res = faiss.StandardGpuResources() # noqa: F841
return True
except (AttributeError, RuntimeError):
return False
def create_ann_index(path: str | Path, dim: int, config: Config) -> BaseANNIndex:
"""Create an ANN index based on config.ann_backend.
Fallback chain for 'auto': faiss-gpu -> faiss-cpu -> hnswlib.
Args:
path: directory for index persistence
dim: vector dimensionality
config: project configuration
Returns:
A BaseANNIndex implementation
Raises:
ImportError: if no suitable backend is available
"""
backend = config.ann_backend
if backend == "faiss":
from codexlens.core.faiss_index import FAISSANNIndex
return FAISSANNIndex(path, dim, config)
if backend == "hnswlib":
from codexlens.core.index import ANNIndex
return ANNIndex(path, dim, config)
# auto: try faiss first, then hnswlib
if _FAISS_AVAILABLE:
from codexlens.core.faiss_index import FAISSANNIndex
gpu_tag = " (GPU available)" if _has_faiss_gpu() else " (CPU)"
logger.info("Auto-selected FAISS ANN backend%s", gpu_tag)
return FAISSANNIndex(path, dim, config)
if _HNSWLIB_AVAILABLE:
from codexlens.core.index import ANNIndex
logger.info("Auto-selected hnswlib ANN backend")
return ANNIndex(path, dim, config)
raise ImportError(
"No ANN backend available. Install faiss-cpu, faiss-gpu, or hnswlib."
)
def create_binary_index(
path: str | Path, dim: int, config: Config
) -> BaseBinaryIndex:
"""Create a binary index based on config.binary_backend.
Fallback chain for 'auto': faiss -> numpy BinaryStore.
Args:
path: directory for index persistence
dim: vector dimensionality
config: project configuration
Returns:
A BaseBinaryIndex implementation
Raises:
ImportError: if no suitable backend is available
"""
backend = config.binary_backend
if backend == "faiss":
from codexlens.core.faiss_index import FAISSBinaryIndex
return FAISSBinaryIndex(path, dim, config)
if backend == "hnswlib":
from codexlens.core.binary import BinaryStore
return BinaryStore(path, dim, config)
# auto: try faiss first, then numpy-based BinaryStore
if _FAISS_AVAILABLE:
from codexlens.core.faiss_index import FAISSBinaryIndex
logger.info("Auto-selected FAISS binary backend")
return FAISSBinaryIndex(path, dim, config)
# numpy BinaryStore is always available (no extra deps)
from codexlens.core.binary import BinaryStore
logger.info("Auto-selected numpy BinaryStore backend")
return BinaryStore(path, dim, config)

View File

@@ -0,0 +1,275 @@
from __future__ import annotations
import logging
import math
import threading
from pathlib import Path
import numpy as np
from codexlens.config import Config
from codexlens.core.base import BaseANNIndex, BaseBinaryIndex
logger = logging.getLogger(__name__)
try:
import faiss
_FAISS_AVAILABLE = True
except ImportError:
faiss = None # type: ignore[assignment]
_FAISS_AVAILABLE = False
def _try_gpu_index(index: "faiss.Index") -> "faiss.Index":
"""Transfer a FAISS index to GPU if faiss-gpu is available.
Returns the GPU index on success, or the original CPU index on failure.
"""
try:
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
logger.info("FAISS index transferred to GPU 0")
return gpu_index
except (AttributeError, RuntimeError) as exc:
logger.debug("GPU transfer unavailable, staying on CPU: %s", exc)
return index
def _to_cpu_for_save(index: "faiss.Index") -> "faiss.Index":
"""Convert a GPU index back to CPU for serialization."""
try:
return faiss.index_gpu_to_cpu(index)
except (AttributeError, RuntimeError):
return index
class FAISSANNIndex(BaseANNIndex):
"""FAISS-based ANN index using IndexHNSWFlat with optional GPU.
Uses Inner Product space with L2-normalized vectors for cosine similarity.
Thread-safe via RLock.
"""
def __init__(self, path: str | Path, dim: int, config: Config) -> None:
if not _FAISS_AVAILABLE:
raise ImportError(
"faiss is required. Install with: pip install faiss-cpu "
"or pip install faiss-gpu"
)
self._path = Path(path)
self._index_path = self._path / "faiss_ann.index"
self._dim = dim
self._config = config
self._lock = threading.RLock()
self._index: faiss.Index | None = None
def _ensure_loaded(self) -> None:
"""Load or initialize the index (caller holds lock)."""
if self._index is not None:
return
self.load()
def load(self) -> None:
"""Load index from disk or initialize a fresh one."""
with self._lock:
if self._index_path.exists():
idx = faiss.read_index(str(self._index_path))
logger.debug(
"Loaded FAISS ANN index from %s (%d items)",
self._index_path, idx.ntotal,
)
else:
# HNSW with flat storage, M=32 by default
m = self._config.hnsw_M
idx = faiss.IndexHNSWFlat(self._dim, m, faiss.METRIC_INNER_PRODUCT)
idx.hnsw.efConstruction = self._config.hnsw_ef_construction
idx.hnsw.efSearch = self._config.hnsw_ef
logger.debug(
"Initialized fresh FAISS HNSW index (dim=%d, M=%d)",
self._dim, m,
)
self._index = _try_gpu_index(idx)
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add L2-normalized float32 vectors.
Vectors are normalized before insertion so that Inner Product
distance equals cosine similarity.
Args:
ids: shape (N,) int64 -- currently unused by FAISS flat index
but kept for API compatibility. FAISS uses sequential IDs.
vectors: shape (N, dim) float32
"""
if len(ids) == 0:
return
vecs = np.ascontiguousarray(vectors, dtype=np.float32)
# Normalize for cosine similarity via Inner Product
faiss.normalize_L2(vecs)
with self._lock:
self._ensure_loaded()
self._index.add(vecs)
def fine_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search for nearest neighbors.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results; defaults to config.ann_top_k
Returns:
(ids, distances) as numpy arrays. For IP space, higher = more
similar, but distances are returned as-is for consumer handling.
"""
k = top_k if top_k is not None else self._config.ann_top_k
with self._lock:
self._ensure_loaded()
count = self._index.ntotal
if count == 0:
return np.array([], dtype=np.int64), np.array([], dtype=np.float32)
k = min(k, count)
# Set efSearch for HNSW accuracy
try:
self._index.hnsw.efSearch = max(self._config.hnsw_ef, k)
except AttributeError:
pass # GPU index may not expose hnsw attribute directly
q = np.ascontiguousarray(query_vec, dtype=np.float32).reshape(1, -1)
faiss.normalize_L2(q)
distances, labels = self._index.search(q, k)
return labels[0].astype(np.int64), distances[0].astype(np.float32)
def save(self) -> None:
"""Save index to disk."""
with self._lock:
if self._index is None:
return
self._path.mkdir(parents=True, exist_ok=True)
cpu_index = _to_cpu_for_save(self._index)
faiss.write_index(cpu_index, str(self._index_path))
def __len__(self) -> int:
with self._lock:
if self._index is None:
return 0
return self._index.ntotal
class FAISSBinaryIndex(BaseBinaryIndex):
"""FAISS-based binary index using IndexBinaryFlat for Hamming distance.
Vectors are binary-quantized (sign bit) before insertion.
Thread-safe via RLock.
"""
def __init__(self, path: str | Path, dim: int, config: Config) -> None:
if not _FAISS_AVAILABLE:
raise ImportError(
"faiss is required. Install with: pip install faiss-cpu "
"or pip install faiss-gpu"
)
self._path = Path(path)
self._index_path = self._path / "faiss_binary.index"
self._dim = dim
self._config = config
self._packed_bytes = math.ceil(dim / 8)
self._lock = threading.RLock()
self._index: faiss.IndexBinary | None = None
def _ensure_loaded(self) -> None:
if self._index is not None:
return
self.load()
def _quantize(self, vectors: np.ndarray) -> np.ndarray:
"""Convert float32 vectors (N, dim) to packed uint8 (N, packed_bytes)."""
binary = (vectors > 0).astype(np.uint8)
return np.packbits(binary, axis=1)
def _quantize_single(self, vec: np.ndarray) -> np.ndarray:
"""Convert a single float32 vector (dim,) to packed uint8 (1, packed_bytes)."""
binary = (vec > 0).astype(np.uint8)
return np.packbits(binary).reshape(1, -1)
def load(self) -> None:
"""Load binary index from disk or initialize a fresh one."""
with self._lock:
if self._index_path.exists():
idx = faiss.read_index_binary(str(self._index_path))
logger.debug(
"Loaded FAISS binary index from %s (%d items)",
self._index_path, idx.ntotal,
)
else:
# IndexBinaryFlat takes dimension in bits
idx = faiss.IndexBinaryFlat(self._dim)
logger.debug(
"Initialized fresh FAISS binary index (dim_bits=%d)", self._dim,
)
self._index = idx
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add float32 vectors (binary-quantized internally).
Args:
ids: shape (N,) int64 -- kept for API compatibility
vectors: shape (N, dim) float32
"""
if len(ids) == 0:
return
packed = self._quantize(vectors)
packed = np.ascontiguousarray(packed, dtype=np.uint8)
with self._lock:
self._ensure_loaded()
self._index.add(packed)
def coarse_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search by Hamming distance.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results; defaults to config.binary_top_k
Returns:
(ids, distances) sorted ascending by Hamming distance
"""
with self._lock:
self._ensure_loaded()
if self._index.ntotal == 0:
return np.array([], dtype=np.int64), np.array([], dtype=np.int32)
k = top_k if top_k is not None else self._config.binary_top_k
k = min(k, self._index.ntotal)
q = self._quantize_single(query_vec)
q = np.ascontiguousarray(q, dtype=np.uint8)
distances, labels = self._index.search(q, k)
return labels[0].astype(np.int64), distances[0].astype(np.int32)
def save(self) -> None:
"""Save binary index to disk."""
with self._lock:
if self._index is None:
return
self._path.mkdir(parents=True, exist_ok=True)
faiss.write_index_binary(self._index, str(self._index_path))
def __len__(self) -> int:
with self._lock:
if self._index is None:
return 0
return self._index.ntotal

View File

@@ -0,0 +1,136 @@
from __future__ import annotations
import logging
import threading
from pathlib import Path
import numpy as np
from codexlens.config import Config
from codexlens.core.base import BaseANNIndex
logger = logging.getLogger(__name__)
try:
import hnswlib
_HNSWLIB_AVAILABLE = True
except ImportError:
_HNSWLIB_AVAILABLE = False
class ANNIndex(BaseANNIndex):
"""HNSW-based approximate nearest neighbor index.
Lazy-loads on first use, thread-safe via RLock.
"""
def __init__(self, path: str | Path, dim: int, config: Config) -> None:
if not _HNSWLIB_AVAILABLE:
raise ImportError("hnswlib is required. Install with: pip install hnswlib")
self._path = Path(path)
self._hnsw_path = self._path / "ann_index.hnsw"
self._dim = dim
self._config = config
self._lock = threading.RLock()
self._index: hnswlib.Index | None = None
# ------------------------------------------------------------------
# Internal helpers
# ------------------------------------------------------------------
def _ensure_loaded(self) -> None:
"""Load or initialize the index (caller holds lock)."""
if self._index is not None:
return
self.load()
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def load(self) -> None:
"""Load index from disk or initialize a fresh one."""
with self._lock:
idx = hnswlib.Index(space="cosine", dim=self._dim)
if self._hnsw_path.exists():
idx.load_index(str(self._hnsw_path), max_elements=0)
idx.set_ef(self._config.hnsw_ef)
logger.debug("Loaded HNSW index from %s (%d items)", self._hnsw_path, idx.get_current_count())
else:
idx.init_index(
max_elements=1000,
ef_construction=self._config.hnsw_ef_construction,
M=self._config.hnsw_M,
)
idx.set_ef(self._config.hnsw_ef)
logger.debug("Initialized fresh HNSW index (dim=%d)", self._dim)
self._index = idx
def add(self, ids: np.ndarray, vectors: np.ndarray) -> None:
"""Add float32 vectors.
Does NOT call save() internally -- callers must call save()
explicitly after batch indexing.
Args:
ids: shape (N,) int64
vectors: shape (N, dim) float32
"""
if len(ids) == 0:
return
vecs = np.ascontiguousarray(vectors, dtype=np.float32)
with self._lock:
self._ensure_loaded()
# Expand capacity if needed
current = self._index.get_current_count()
max_el = self._index.get_max_elements()
needed = current + len(ids)
if needed > max_el:
new_cap = max(max_el * 2, needed + 100)
self._index.resize_index(new_cap)
self._index.add_items(vecs, ids.astype(np.int64))
def fine_search(
self, query_vec: np.ndarray, top_k: int | None = None
) -> tuple[np.ndarray, np.ndarray]:
"""Search for nearest neighbors.
Args:
query_vec: float32 vector of shape (dim,)
top_k: number of results; defaults to config.ann_top_k
Returns:
(ids, distances) as numpy arrays
"""
k = top_k if top_k is not None else self._config.ann_top_k
with self._lock:
self._ensure_loaded()
count = self._index.get_current_count()
if count == 0:
return np.array([], dtype=np.int64), np.array([], dtype=np.float32)
k = min(k, count)
self._index.set_ef(max(self._config.hnsw_ef, k))
q = np.ascontiguousarray(query_vec, dtype=np.float32).reshape(1, -1)
labels, distances = self._index.knn_query(q, k=k)
return labels[0].astype(np.int64), distances[0].astype(np.float32)
def save(self) -> None:
"""Save index to disk (caller may or may not hold lock)."""
with self._lock:
if self._index is None:
return
self._path.mkdir(parents=True, exist_ok=True)
self._index.save_index(str(self._hnsw_path))
def __len__(self) -> int:
with self._lock:
if self._index is None:
return 0
return self._index.get_current_count()

View File

@@ -0,0 +1,4 @@
from .base import BaseEmbedder
from .local import FastEmbedEmbedder, EMBED_PROFILES
__all__ = ["BaseEmbedder", "FastEmbedEmbedder", "EMBED_PROFILES"]

View File

@@ -0,0 +1,13 @@
from __future__ import annotations
from abc import ABC, abstractmethod
import numpy as np
class BaseEmbedder(ABC):
@abstractmethod
def embed_single(self, text: str) -> np.ndarray:
"""Embed a single text, returns float32 ndarray shape (dim,)."""
@abstractmethod
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
"""Embed a list of texts, returns list of float32 ndarrays."""

View File

@@ -0,0 +1,53 @@
from __future__ import annotations
import numpy as np
from ..config import Config
from .base import BaseEmbedder
EMBED_PROFILES = {
"small": "BAAI/bge-small-en-v1.5", # 384d
"base": "BAAI/bge-base-en-v1.5", # 768d
"large": "BAAI/bge-large-en-v1.5", # 1024d
"code": "jinaai/jina-embeddings-v2-base-code", # 768d
}
class FastEmbedEmbedder(BaseEmbedder):
"""Embedder backed by fastembed.TextEmbedding with lazy model loading."""
def __init__(self, config: Config) -> None:
self._config = config
self._model = None
def _load(self) -> None:
"""Lazy-load the fastembed TextEmbedding model on first use."""
if self._model is not None:
return
from fastembed import TextEmbedding
providers = self._config.resolve_embed_providers()
try:
self._model = TextEmbedding(
model_name=self._config.embed_model,
providers=providers,
)
except TypeError:
# Older fastembed versions may not accept providers kwarg
self._model = TextEmbedding(model_name=self._config.embed_model)
def embed_single(self, text: str) -> np.ndarray:
"""Embed a single text, returns float32 ndarray of shape (dim,)."""
self._load()
result = list(self._model.embed([text]))
return result[0].astype(np.float32)
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
"""Embed a list of texts in batches, returns list of float32 ndarrays."""
self._load()
batch_size = self._config.embed_batch_size
results: list[np.ndarray] = []
for start in range(0, len(texts), batch_size):
batch = texts[start : start + batch_size]
for vec in self._model.embed(batch):
results.append(vec.astype(np.float32))
return results

View File

@@ -0,0 +1,5 @@
from __future__ import annotations
from .pipeline import IndexingPipeline, IndexStats
__all__ = ["IndexingPipeline", "IndexStats"]

View File

@@ -0,0 +1,277 @@
"""Three-stage parallel indexing pipeline: chunk -> embed -> index.
Uses threading.Thread with queue.Queue for producer-consumer handoff.
The GIL is acceptable because embedding (onnxruntime) releases it in C extensions.
"""
from __future__ import annotations
import logging
import queue
import threading
import time
from dataclasses import dataclass
from pathlib import Path
import numpy as np
from codexlens.config import Config
from codexlens.core.binary import BinaryStore
from codexlens.core.index import ANNIndex
from codexlens.embed.base import BaseEmbedder
from codexlens.search.fts import FTSEngine
logger = logging.getLogger(__name__)
# Sentinel value to signal worker shutdown
_SENTINEL = None
# Defaults for chunking (can be overridden via index_files kwargs)
_DEFAULT_MAX_CHUNK_CHARS = 800
_DEFAULT_CHUNK_OVERLAP = 100
@dataclass
class IndexStats:
"""Statistics returned after indexing completes."""
files_processed: int = 0
chunks_created: int = 0
duration_seconds: float = 0.0
class IndexingPipeline:
"""Parallel 3-stage indexing pipeline with queue-based handoff.
Stage 1 (main thread): Read files, chunk text, push to embed_queue.
Stage 2 (embed worker): Pull text batches, call embed_batch(), push vectors to index_queue.
Stage 3 (index worker): Pull vectors+ids, call BinaryStore.add(), ANNIndex.add(), FTS.add_documents().
After all stages complete, save() is called on BinaryStore and ANNIndex exactly once.
"""
def __init__(
self,
embedder: BaseEmbedder,
binary_store: BinaryStore,
ann_index: ANNIndex,
fts: FTSEngine,
config: Config,
) -> None:
self._embedder = embedder
self._binary_store = binary_store
self._ann_index = ann_index
self._fts = fts
self._config = config
def index_files(
self,
files: list[Path],
*,
root: Path | None = None,
max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS,
chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP,
max_file_size: int = 50_000,
) -> IndexStats:
"""Run the 3-stage pipeline on the given files.
Args:
files: List of file paths to index.
root: Optional root for computing relative paths. If None, uses
each file's absolute path as its identifier.
max_chunk_chars: Maximum characters per chunk.
chunk_overlap: Character overlap between consecutive chunks.
max_file_size: Skip files larger than this (bytes).
Returns:
IndexStats with counts and timing.
"""
if not files:
return IndexStats()
t0 = time.monotonic()
embed_queue: queue.Queue = queue.Queue(maxsize=4)
index_queue: queue.Queue = queue.Queue(maxsize=4)
# Track errors from workers
worker_errors: list[Exception] = []
error_lock = threading.Lock()
def _record_error(exc: Exception) -> None:
with error_lock:
worker_errors.append(exc)
# --- Start workers ---
embed_thread = threading.Thread(
target=self._embed_worker,
args=(embed_queue, index_queue, _record_error),
daemon=True,
name="indexing-embed",
)
index_thread = threading.Thread(
target=self._index_worker,
args=(index_queue, _record_error),
daemon=True,
name="indexing-index",
)
embed_thread.start()
index_thread.start()
# --- Stage 1: chunk files (main thread) ---
chunk_id = 0
files_processed = 0
chunks_created = 0
for fpath in files:
try:
if fpath.stat().st_size > max_file_size:
continue
text = fpath.read_text(encoding="utf-8", errors="replace")
except Exception as exc:
logger.debug("Skipping %s: %s", fpath, exc)
continue
rel_path = str(fpath.relative_to(root)) if root else str(fpath)
file_chunks = self._chunk_text(text, rel_path, max_chunk_chars, chunk_overlap)
if not file_chunks:
continue
files_processed += 1
# Assign sequential IDs and push batch to embed queue
batch_ids = []
batch_texts = []
batch_paths = []
for chunk_text, path in file_chunks:
batch_ids.append(chunk_id)
batch_texts.append(chunk_text)
batch_paths.append(path)
chunk_id += 1
chunks_created += len(batch_ids)
embed_queue.put((batch_ids, batch_texts, batch_paths))
# Signal embed worker: no more data
embed_queue.put(_SENTINEL)
# Wait for workers to finish
embed_thread.join()
index_thread.join()
# --- Final flush ---
self._binary_store.save()
self._ann_index.save()
duration = time.monotonic() - t0
stats = IndexStats(
files_processed=files_processed,
chunks_created=chunks_created,
duration_seconds=round(duration, 2),
)
logger.info(
"Indexing complete: %d files, %d chunks in %.1fs",
stats.files_processed,
stats.chunks_created,
stats.duration_seconds,
)
# Raise first worker error if any occurred
if worker_errors:
raise worker_errors[0]
return stats
# ------------------------------------------------------------------
# Workers
# ------------------------------------------------------------------
def _embed_worker(
self,
in_q: queue.Queue,
out_q: queue.Queue,
on_error: callable,
) -> None:
"""Stage 2: Pull chunk batches, embed, push (ids, vecs, docs) to index queue."""
try:
while True:
item = in_q.get()
if item is _SENTINEL:
break
batch_ids, batch_texts, batch_paths = item
try:
vecs = self._embedder.embed_batch(batch_texts)
vec_array = np.array(vecs, dtype=np.float32)
id_array = np.array(batch_ids, dtype=np.int64)
out_q.put((id_array, vec_array, batch_texts, batch_paths))
except Exception as exc:
logger.error("Embed worker error: %s", exc)
on_error(exc)
finally:
# Signal index worker: no more data
out_q.put(_SENTINEL)
def _index_worker(
self,
in_q: queue.Queue,
on_error: callable,
) -> None:
"""Stage 3: Pull (ids, vecs, texts, paths), write to stores."""
while True:
item = in_q.get()
if item is _SENTINEL:
break
id_array, vec_array, texts, paths = item
try:
self._binary_store.add(id_array, vec_array)
self._ann_index.add(id_array, vec_array)
fts_docs = [
(int(id_array[i]), paths[i], texts[i])
for i in range(len(id_array))
]
self._fts.add_documents(fts_docs)
except Exception as exc:
logger.error("Index worker error: %s", exc)
on_error(exc)
# ------------------------------------------------------------------
# Chunking
# ------------------------------------------------------------------
@staticmethod
def _chunk_text(
text: str,
path: str,
max_chars: int,
overlap: int,
) -> list[tuple[str, str]]:
"""Split file text into overlapping chunks.
Returns list of (chunk_text, path) tuples.
"""
if not text.strip():
return []
chunks: list[tuple[str, str]] = []
lines = text.splitlines(keepends=True)
current: list[str] = []
current_len = 0
for line in lines:
if current_len + len(line) > max_chars and current:
chunk = "".join(current)
chunks.append((chunk, path))
# overlap: keep last N characters
tail = "".join(current)[-overlap:]
current = [tail] if tail else []
current_len = len(tail)
current.append(line)
current_len += len(line)
if current:
chunks.append(("".join(current), path))
return chunks

View File

@@ -0,0 +1,5 @@
from .base import BaseReranker
from .local import FastEmbedReranker
from .api import APIReranker
__all__ = ["BaseReranker", "FastEmbedReranker", "APIReranker"]

View File

@@ -0,0 +1,103 @@
from __future__ import annotations
import logging
import time
import httpx
from codexlens.config import Config
from .base import BaseReranker
logger = logging.getLogger(__name__)
class APIReranker(BaseReranker):
"""Reranker backed by a remote HTTP API (SiliconFlow/Cohere/Jina format)."""
def __init__(self, config: Config) -> None:
self._config = config
self._client = httpx.Client(
headers={
"Authorization": f"Bearer {config.reranker_api_key}",
"Content-Type": "application/json",
},
)
def score_pairs(self, query: str, documents: list[str]) -> list[float]:
if not documents:
return []
max_tokens = self._config.reranker_api_max_tokens_per_batch
batches = self._split_batches(documents, max_tokens)
scores = [0.0] * len(documents)
for batch in batches:
batch_scores = self._call_api_with_retry(query, batch)
for orig_idx, score in batch_scores.items():
scores[orig_idx] = score
return scores
def _split_batches(
self, documents: list[str], max_tokens: int
) -> list[list[tuple[int, str]]]:
batches: list[list[tuple[int, str]]] = []
current_batch: list[tuple[int, str]] = []
current_tokens = 0
for idx, text in enumerate(documents):
doc_tokens = len(text) // 4
if current_tokens + doc_tokens > max_tokens and current_batch:
batches.append(current_batch)
current_batch = []
current_tokens = 0
current_batch.append((idx, text))
current_tokens += doc_tokens
if current_batch:
batches.append(current_batch)
return batches
def _call_api_with_retry(
self,
query: str,
docs: list[tuple[int, str]],
max_retries: int = 3,
) -> dict[int, float]:
url = self._config.reranker_api_url.rstrip("/") + "/rerank"
payload = {
"model": self._config.reranker_api_model,
"query": query,
"documents": [t for _, t in docs],
}
last_exc: Exception | None = None
for attempt in range(max_retries):
try:
response = self._client.post(url, json=payload)
except Exception as exc:
last_exc = exc
time.sleep((2 ** attempt) * 0.5)
continue
if response.status_code in (429, 503):
logger.warning(
"API reranker returned HTTP %s (attempt %d/%d), retrying...",
response.status_code,
attempt + 1,
max_retries,
)
time.sleep((2 ** attempt) * 0.5)
continue
response.raise_for_status()
data = response.json()
results = data.get("results", [])
scores: dict[int, float] = {}
for item in results:
local_idx = int(item["index"])
orig_idx = docs[local_idx][0]
scores[orig_idx] = float(item["relevance_score"])
return scores
raise RuntimeError(
f"API reranker failed after {max_retries} attempts. Last error: {last_exc}"
)

View File

@@ -0,0 +1,8 @@
from __future__ import annotations
from abc import ABC, abstractmethod
class BaseReranker(ABC):
@abstractmethod
def score_pairs(self, query: str, documents: list[str]) -> list[float]:
"""Score (query, doc) pairs. Returns list of floats same length as documents."""

View File

@@ -0,0 +1,25 @@
from __future__ import annotations
from codexlens.config import Config
from .base import BaseReranker
class FastEmbedReranker(BaseReranker):
"""Local reranker backed by fastembed TextCrossEncoder."""
def __init__(self, config: Config) -> None:
self._config = config
self._model = None
def _load(self) -> None:
if self._model is None:
from fastembed.rerank.cross_encoder import TextCrossEncoder
self._model = TextCrossEncoder(model_name=self._config.reranker_model)
def score_pairs(self, query: str, documents: list[str]) -> list[float]:
self._load()
results = list(self._model.rerank(query, documents))
scores = [0.0] * len(documents)
for r in results:
scores[r.index] = float(r.score)
return scores

View File

@@ -0,0 +1,8 @@
from .fts import FTSEngine
from .fusion import reciprocal_rank_fusion, detect_query_intent, QueryIntent, DEFAULT_WEIGHTS
from .pipeline import SearchPipeline, SearchResult
__all__ = [
"FTSEngine", "reciprocal_rank_fusion", "detect_query_intent",
"QueryIntent", "DEFAULT_WEIGHTS", "SearchPipeline", "SearchResult",
]

View File

@@ -0,0 +1,69 @@
from __future__ import annotations
import sqlite3
from pathlib import Path
class FTSEngine:
def __init__(self, db_path: str | Path) -> None:
self._conn = sqlite3.connect(str(db_path), check_same_thread=False)
self._conn.execute(
"CREATE VIRTUAL TABLE IF NOT EXISTS docs "
"USING fts5(content, tokenize='porter unicode61')"
)
self._conn.execute(
"CREATE TABLE IF NOT EXISTS docs_meta "
"(id INTEGER PRIMARY KEY, path TEXT)"
)
self._conn.commit()
def add_documents(self, docs: list[tuple[int, str, str]]) -> None:
"""Add documents in batch. docs: list of (id, path, content)."""
if not docs:
return
self._conn.executemany(
"INSERT OR REPLACE INTO docs_meta (id, path) VALUES (?, ?)",
[(doc_id, path) for doc_id, path, content in docs],
)
self._conn.executemany(
"INSERT OR REPLACE INTO docs (rowid, content) VALUES (?, ?)",
[(doc_id, content) for doc_id, path, content in docs],
)
self._conn.commit()
def exact_search(self, query: str, top_k: int = 50) -> list[tuple[int, float]]:
"""FTS5 MATCH query, return (id, bm25_score) sorted by score descending."""
try:
rows = self._conn.execute(
"SELECT rowid, bm25(docs) AS score FROM docs "
"WHERE docs MATCH ? ORDER BY score LIMIT ?",
(query, top_k),
).fetchall()
except sqlite3.OperationalError:
return []
# bm25 in SQLite FTS5 returns negative values (lower = better match)
# Negate so higher is better
return [(int(row[0]), -float(row[1])) for row in rows]
def fuzzy_search(self, query: str, top_k: int = 50) -> list[tuple[int, float]]:
"""Prefix search: each token + '*', return (id, score) sorted descending."""
tokens = query.strip().split()
if not tokens:
return []
prefix_query = " ".join(t + "*" for t in tokens)
try:
rows = self._conn.execute(
"SELECT rowid, bm25(docs) AS score FROM docs "
"WHERE docs MATCH ? ORDER BY score LIMIT ?",
(prefix_query, top_k),
).fetchall()
except sqlite3.OperationalError:
return []
return [(int(row[0]), -float(row[1])) for row in rows]
def get_content(self, doc_id: int) -> str:
"""Retrieve content for a doc_id."""
row = self._conn.execute(
"SELECT content FROM docs WHERE rowid = ?", (doc_id,)
).fetchone()
return row[0] if row else ""

View File

@@ -0,0 +1,106 @@
from __future__ import annotations
import re
from enum import Enum
DEFAULT_WEIGHTS: dict[str, float] = {
"exact": 0.25,
"fuzzy": 0.10,
"vector": 0.50,
"graph": 0.15,
}
_CODE_CAMEL_RE = re.compile(r"[a-z][A-Z]")
_CODE_SNAKE_RE = re.compile(r"\b[a-z_]+_[a-z_]+\b")
_CODE_SYMBOLS_RE = re.compile(r"[.\[\](){}]|->|::")
_CODE_KEYWORDS_RE = re.compile(r"\b(import|def|class|return|from|async|await|lambda|yield)\b")
_QUESTION_WORDS_RE = re.compile(r"\b(how|what|why|when|where|which|who|does|do|is|are|can|should)\b", re.IGNORECASE)
class QueryIntent(Enum):
CODE_SYMBOL = "code_symbol"
NATURAL_LANGUAGE = "natural"
MIXED = "mixed"
def detect_query_intent(query: str) -> QueryIntent:
"""Detect whether query is a code symbol, natural language, or mixed."""
words = query.strip().split()
word_count = len(words)
code_signals = 0
natural_signals = 0
if _CODE_CAMEL_RE.search(query):
code_signals += 2
if _CODE_SNAKE_RE.search(query):
code_signals += 2
if _CODE_SYMBOLS_RE.search(query):
code_signals += 2
if _CODE_KEYWORDS_RE.search(query):
code_signals += 2
if "`" in query:
code_signals += 1
if word_count < 4:
code_signals += 1
if _QUESTION_WORDS_RE.search(query):
natural_signals += 2
if word_count > 5:
natural_signals += 2
if code_signals == 0 and word_count >= 3:
natural_signals += 1
if code_signals >= 2 and natural_signals == 0:
return QueryIntent.CODE_SYMBOL
if natural_signals >= 2 and code_signals == 0:
return QueryIntent.NATURAL_LANGUAGE
if code_signals >= 2 and natural_signals == 0:
return QueryIntent.CODE_SYMBOL
if natural_signals > code_signals:
return QueryIntent.NATURAL_LANGUAGE
if code_signals > natural_signals:
return QueryIntent.CODE_SYMBOL
return QueryIntent.MIXED
def get_adaptive_weights(intent: QueryIntent, base: dict | None = None) -> dict[str, float]:
"""Return weights adapted to query intent."""
weights = dict(base or DEFAULT_WEIGHTS)
if intent == QueryIntent.CODE_SYMBOL:
weights["exact"] = 0.45
weights["vector"] = 0.35
elif intent == QueryIntent.NATURAL_LANGUAGE:
weights["vector"] = 0.65
weights["exact"] = 0.15
# MIXED: use weights as-is
return weights
def reciprocal_rank_fusion(
results: dict[str, list[tuple[int, float]]],
weights: dict[str, float] | None = None,
k: int = 60,
) -> list[tuple[int, float]]:
"""Fuse ranked result lists using Reciprocal Rank Fusion.
results: {source_name: [(doc_id, score), ...]} each list sorted desc by score.
weights: weight per source (defaults to equal weight across all sources).
k: RRF constant (default 60).
Returns sorted list of (doc_id, fused_score) descending.
"""
if not results:
return []
sources = list(results.keys())
if weights is None:
equal_w = 1.0 / len(sources)
weights = {s: equal_w for s in sources}
scores: dict[int, float] = {}
for source, ranked_list in results.items():
w = weights.get(source, 0.0)
for rank, (doc_id, _) in enumerate(ranked_list, start=1):
scores[doc_id] = scores.get(doc_id, 0.0) + w * (1.0 / (k + rank))
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

View File

@@ -0,0 +1,163 @@
from __future__ import annotations
import logging
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
import numpy as np
from ..config import Config
from ..core import ANNIndex, BinaryStore
from ..embed import BaseEmbedder
from ..rerank import BaseReranker
from .fts import FTSEngine
from .fusion import (
DEFAULT_WEIGHTS,
detect_query_intent,
get_adaptive_weights,
reciprocal_rank_fusion,
)
_log = logging.getLogger(__name__)
@dataclass
class SearchResult:
id: int
path: str
score: float
snippet: str = ""
class SearchPipeline:
def __init__(
self,
embedder: BaseEmbedder,
binary_store: BinaryStore,
ann_index: ANNIndex,
reranker: BaseReranker,
fts: FTSEngine,
config: Config,
) -> None:
self._embedder = embedder
self._binary_store = binary_store
self._ann_index = ann_index
self._reranker = reranker
self._fts = fts
self._config = config
# -- Helper: vector search (binary coarse + ANN fine) -----------------
def _vector_search(
self, query_vec: np.ndarray
) -> list[tuple[int, float]]:
"""Run binary coarse search then ANN fine search and intersect."""
cfg = self._config
# Binary coarse search -> candidate_ids set
candidate_ids_list, _ = self._binary_store.coarse_search(
query_vec, top_k=cfg.binary_top_k
)
candidate_ids = set(candidate_ids_list)
# ANN fine search on full index, then intersect with binary candidates
ann_ids, ann_scores = self._ann_index.fine_search(
query_vec, top_k=cfg.ann_top_k
)
# Keep only results that appear in binary candidates (2-stage funnel)
vector_results: list[tuple[int, float]] = [
(int(doc_id), float(score))
for doc_id, score in zip(ann_ids, ann_scores)
if int(doc_id) in candidate_ids
]
# Fall back to full ANN results if intersection is empty
if not vector_results:
vector_results = [
(int(doc_id), float(score))
for doc_id, score in zip(ann_ids, ann_scores)
]
return vector_results
# -- Helper: FTS search (exact + fuzzy) ------------------------------
def _fts_search(
self, query: str
) -> tuple[list[tuple[int, float]], list[tuple[int, float]]]:
"""Run exact and fuzzy full-text search."""
cfg = self._config
exact_results = self._fts.exact_search(query, top_k=cfg.fts_top_k)
fuzzy_results = self._fts.fuzzy_search(query, top_k=cfg.fts_top_k)
return exact_results, fuzzy_results
# -- Main search entry point -----------------------------------------
def search(self, query: str, top_k: int | None = None) -> list[SearchResult]:
cfg = self._config
final_top_k = top_k if top_k is not None else cfg.reranker_top_k
# 1. Detect intent -> adaptive weights
intent = detect_query_intent(query)
weights = get_adaptive_weights(intent, cfg.fusion_weights)
# 2. Embed query
query_vec = self._embedder.embed([query])[0]
# 3. Parallel vector + FTS search
vector_results: list[tuple[int, float]] = []
exact_results: list[tuple[int, float]] = []
fuzzy_results: list[tuple[int, float]] = []
with ThreadPoolExecutor(max_workers=2) as pool:
vec_future = pool.submit(self._vector_search, query_vec)
fts_future = pool.submit(self._fts_search, query)
# Collect vector results
try:
vector_results = vec_future.result()
except Exception:
_log.warning("Vector search failed, using empty results", exc_info=True)
# Collect FTS results
try:
exact_results, fuzzy_results = fts_future.result()
except Exception:
_log.warning("FTS search failed, using empty results", exc_info=True)
# 4. RRF fusion
fusion_input: dict[str, list[tuple[int, float]]] = {}
if vector_results:
fusion_input["vector"] = vector_results
if exact_results:
fusion_input["exact"] = exact_results
if fuzzy_results:
fusion_input["fuzzy"] = fuzzy_results
if not fusion_input:
return []
fused = reciprocal_rank_fusion(fusion_input, weights=weights, k=cfg.fusion_k)
# 5. Rerank top candidates
rerank_ids = [doc_id for doc_id, _ in fused[:50]]
contents = [self._fts.get_content(doc_id) for doc_id in rerank_ids]
rerank_scores = self._reranker.score_pairs(query, contents)
# 6. Sort by rerank score, build SearchResult list
ranked = sorted(
zip(rerank_ids, rerank_scores), key=lambda x: x[1], reverse=True
)
results: list[SearchResult] = []
for doc_id, score in ranked[:final_top_k]:
path = self._fts._conn.execute(
"SELECT path FROM docs_meta WHERE id = ?", (doc_id,)
).fetchone()
results.append(
SearchResult(
id=doc_id,
path=path[0] if path else "",
score=float(score),
snippet=self._fts.get_content(doc_id)[:200],
)
)
return results

View File

View File

@@ -0,0 +1,108 @@
import pytest
import numpy as np
import tempfile
from pathlib import Path
from codexlens.config import Config
from codexlens.core import ANNIndex, BinaryStore
from codexlens.embed.base import BaseEmbedder
from codexlens.rerank.base import BaseReranker
from codexlens.search.fts import FTSEngine
from codexlens.search.pipeline import SearchPipeline
# Test documents: 20 code snippets with id, path, content
TEST_DOCS = [
(0, "auth.py", "def authenticate(user, password): return check_hash(password, user.hash)"),
(1, "auth.py", "def authorize(user, permission): return permission in user.roles"),
(2, "models.py", "class User: def __init__(self, name, email): self.name = name; self.email = email"),
(3, "models.py", "class Session: token = None; expires_at = None"),
(4, "middleware.py", "def auth_middleware(request): token = request.headers.get('Authorization')"),
(5, "utils.py", "def hash_password(password): import bcrypt; return bcrypt.hashpw(password)"),
(6, "config.py", "DATABASE_URL = os.environ.get('DATABASE_URL', 'sqlite:///db.sqlite3')"),
(7, "search.py", "def search_users(query): return User.objects.filter(name__icontains=query)"),
(8, "api.py", "def get_user(request, user_id): user = User.objects.get(id=user_id)"),
(9, "api.py", "def create_user(request): data = request.json(); user = User(**data)"),
(10, "tests.py", "def test_authenticate(): assert authenticate('admin', 'pass') is not None"),
(11, "tests.py", "def test_search(): results = search_users('alice'); assert len(results) > 0"),
(12, "router.py", "app.route('/users', methods=['GET'])(list_users)"),
(13, "router.py", "app.route('/login', methods=['POST'])(login_handler)"),
(14, "db.py", "def get_connection(): return sqlite3.connect(DATABASE_URL)"),
(15, "cache.py", "def cache_get(key): return redis_client.get(key)"),
(16, "cache.py", "def cache_set(key, value, ttl=3600): redis_client.setex(key, ttl, value)"),
(17, "errors.py", "class AuthError(Exception): status_code = 401"),
(18, "errors.py", "class NotFoundError(Exception): status_code = 404"),
(19, "validators.py", "def validate_email(email): return '@' in email and '.' in email.split('@')[1]"),
]
DIM = 32 # Use small dim for fast tests
def make_stable_vec(doc_id: int, dim: int = DIM) -> np.ndarray:
"""Generate a deterministic float32 vector for a given doc_id."""
rng = np.random.default_rng(seed=doc_id)
vec = rng.standard_normal(dim).astype(np.float32)
vec /= np.linalg.norm(vec)
return vec
class MockEmbedder(BaseEmbedder):
"""Returns stable deterministic vectors based on content hash."""
def embed_single(self, text: str) -> np.ndarray:
seed = hash(text) % (2**31)
rng = np.random.default_rng(seed=seed)
vec = rng.standard_normal(DIM).astype(np.float32)
vec /= np.linalg.norm(vec)
return vec
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
return [self.embed_single(t) for t in texts]
def embed(self, texts: list[str]) -> list[np.ndarray]:
"""Called by SearchPipeline as self._embedder.embed([query])[0]."""
return self.embed_batch(texts)
class MockReranker(BaseReranker):
"""Returns score based on simple keyword overlap."""
def score_pairs(self, query: str, documents: list[str]) -> list[float]:
query_words = set(query.lower().split())
scores = []
for doc in documents:
doc_words = set(doc.lower().split())
overlap = len(query_words & doc_words)
scores.append(float(overlap) / max(len(query_words), 1))
return scores
@pytest.fixture
def config():
return Config.small() # hnsw_ef=50, hnsw_M=16, binary_top_k=50, ann_top_k=20, rerank_top_k=10
@pytest.fixture
def search_pipeline(tmp_path, config):
"""Build a full SearchPipeline with 20 test docs indexed."""
embedder = MockEmbedder()
binary_store = BinaryStore(tmp_path / "binary", dim=DIM, config=config)
ann_index = ANNIndex(tmp_path / "ann.hnsw", dim=DIM, config=config)
fts = FTSEngine(tmp_path / "fts.db")
reranker = MockReranker()
# Index all test docs
ids = np.array([d[0] for d in TEST_DOCS], dtype=np.int64)
vectors = np.array([embedder.embed_single(d[2]) for d in TEST_DOCS], dtype=np.float32)
binary_store.add(ids, vectors)
ann_index.add(ids, vectors)
fts.add_documents(TEST_DOCS)
return SearchPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
reranker=reranker,
fts=fts,
config=config,
)

View File

@@ -0,0 +1,44 @@
"""Integration tests for SearchPipeline using real components and mock embedder/reranker."""
from __future__ import annotations
def test_vector_search_returns_results(search_pipeline):
results = search_pipeline.search("authentication middleware")
assert len(results) > 0
assert all(isinstance(r.score, float) for r in results)
def test_exact_keyword_search(search_pipeline):
results = search_pipeline.search("authenticate")
assert len(results) > 0
result_ids = {r.id for r in results}
# Doc 0 and 10 both contain "authenticate"
assert result_ids & {0, 10}, f"Expected doc 0 or 10 in results, got {result_ids}"
def test_pipeline_top_k_limit(search_pipeline):
results = search_pipeline.search("user", top_k=5)
assert len(results) <= 5
def test_search_result_fields_populated(search_pipeline):
results = search_pipeline.search("password")
assert len(results) > 0
for r in results:
assert r.id >= 0
assert r.score >= 0
assert isinstance(r.path, str)
def test_empty_query_handled(search_pipeline):
results = search_pipeline.search("")
assert isinstance(results, list) # no exception
def test_different_queries_give_different_results(search_pipeline):
r1 = search_pipeline.search("authenticate user")
r2 = search_pipeline.search("cache redis")
# Results should differ (different top IDs or scores), unless both are empty
ids1 = [r.id for r in r1]
ids2 = [r.id for r in r2]
assert ids1 != ids2 or len(r1) == 0

View File

View File

@@ -0,0 +1,31 @@
from codexlens.config import Config
def test_config_instantiates_no_args():
cfg = Config()
assert cfg is not None
def test_defaults_hnsw_ef():
cfg = Config.defaults()
assert cfg.hnsw_ef == 150
def test_defaults_hnsw_M():
cfg = Config.defaults()
assert cfg.hnsw_M == 32
def test_small_hnsw_ef():
cfg = Config.small()
assert cfg.hnsw_ef == 50
def test_custom_instantiation():
cfg = Config(hnsw_ef=100)
assert cfg.hnsw_ef == 100
def test_fusion_weights_keys():
cfg = Config()
assert set(cfg.fusion_weights.keys()) == {"exact", "fuzzy", "vector", "graph"}

View File

@@ -0,0 +1,136 @@
"""Unit tests for BinaryStore and ANNIndex (no fastembed required)."""
from __future__ import annotations
import concurrent.futures
import tempfile
from pathlib import Path
import numpy as np
import pytest
from codexlens.config import Config
from codexlens.core import ANNIndex, BinaryStore
DIM = 32
RNG = np.random.default_rng(42)
def make_vectors(n: int, dim: int = DIM) -> np.ndarray:
return RNG.standard_normal((n, dim)).astype(np.float32)
def make_ids(n: int, start: int = 0) -> np.ndarray:
return np.arange(start, start + n, dtype=np.int64)
# ---------------------------------------------------------------------------
# BinaryStore tests
# ---------------------------------------------------------------------------
class TestBinaryStore:
def test_binary_store_add_and_search(self, tmp_path: Path) -> None:
cfg = Config.small()
store = BinaryStore(tmp_path, DIM, cfg)
vecs = make_vectors(10)
ids = make_ids(10)
store.add(ids, vecs)
assert len(store) == 10
top_k = 5
ret_ids, ret_dists = store.coarse_search(vecs[0], top_k=top_k)
assert ret_ids.shape == (top_k,)
assert ret_dists.shape == (top_k,)
# distances are non-negative integers
assert (ret_dists >= 0).all()
def test_binary_hamming_correctness(self, tmp_path: Path) -> None:
cfg = Config.small()
store = BinaryStore(tmp_path, DIM, cfg)
vecs = make_vectors(20)
ids = make_ids(20)
store.add(ids, vecs)
# Query with the exact stored vector; it must be the top-1 result
query = vecs[7]
ret_ids, ret_dists = store.coarse_search(query, top_k=1)
assert ret_ids[0] == 7
assert ret_dists[0] == 0 # Hamming distance to itself is 0
def test_binary_store_persist(self, tmp_path: Path) -> None:
cfg = Config.small()
store = BinaryStore(tmp_path, DIM, cfg)
vecs = make_vectors(15)
ids = make_ids(15)
store.add(ids, vecs)
store.save()
# Load into a fresh instance
store2 = BinaryStore(tmp_path, DIM, cfg)
assert len(store2) == 15
query = vecs[3]
ret_ids, ret_dists = store2.coarse_search(query, top_k=1)
assert ret_ids[0] == 3
assert ret_dists[0] == 0
# ---------------------------------------------------------------------------
# ANNIndex tests
# ---------------------------------------------------------------------------
class TestANNIndex:
def test_ann_index_add_and_search(self, tmp_path: Path) -> None:
cfg = Config.small()
idx = ANNIndex(tmp_path, DIM, cfg)
vecs = make_vectors(50)
ids = make_ids(50)
idx.add(ids, vecs)
assert len(idx) == 50
ret_ids, ret_dists = idx.fine_search(vecs[0], top_k=5)
assert len(ret_ids) == 5
assert len(ret_dists) == 5
def test_ann_index_thread_safety(self, tmp_path: Path) -> None:
cfg = Config.small()
idx = ANNIndex(tmp_path, DIM, cfg)
vecs = make_vectors(50)
ids = make_ids(50)
idx.add(ids, vecs)
query = vecs[0]
errors: list[Exception] = []
def search() -> None:
try:
idx.fine_search(query, top_k=3)
except Exception as exc:
errors.append(exc)
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool:
futures = [pool.submit(search) for _ in range(5)]
concurrent.futures.wait(futures)
assert errors == [], f"Thread safety errors: {errors}"
def test_ann_index_save_load(self, tmp_path: Path) -> None:
cfg = Config.small()
idx = ANNIndex(tmp_path, DIM, cfg)
vecs = make_vectors(30)
ids = make_ids(30)
idx.add(ids, vecs)
idx.save()
# Load into a fresh instance
idx2 = ANNIndex(tmp_path, DIM, cfg)
idx2.load()
assert len(idx2) == 30
ret_ids, ret_dists = idx2.fine_search(vecs[10], top_k=1)
assert len(ret_ids) == 1
assert ret_ids[0] == 10

View File

@@ -0,0 +1,80 @@
from __future__ import annotations
import sys
import types
import unittest
from unittest.mock import MagicMock, patch
import numpy as np
def _make_fastembed_mock():
"""Build a minimal fastembed stub so imports succeed without the real package."""
fastembed_mod = types.ModuleType("fastembed")
fastembed_mod.TextEmbedding = MagicMock()
sys.modules.setdefault("fastembed", fastembed_mod)
return fastembed_mod
_make_fastembed_mock()
from codexlens.config import Config # noqa: E402
from codexlens.embed.base import BaseEmbedder # noqa: E402
from codexlens.embed.local import EMBED_PROFILES, FastEmbedEmbedder # noqa: E402
class TestEmbedSingle(unittest.TestCase):
def test_embed_single_returns_float32_ndarray(self):
config = Config()
embedder = FastEmbedEmbedder(config)
mock_model = MagicMock()
mock_model.embed.return_value = iter([np.ones(384, dtype=np.float64)])
# Inject mock model directly to bypass lazy load (no real fastembed needed)
embedder._model = mock_model
result = embedder.embed_single("hello world")
self.assertIsInstance(result, np.ndarray)
self.assertEqual(result.dtype, np.float32)
self.assertEqual(result.shape, (384,))
class TestEmbedBatch(unittest.TestCase):
def test_embed_batch_returns_list(self):
config = Config()
embedder = FastEmbedEmbedder(config)
vecs = [np.ones(384, dtype=np.float64) * i for i in range(3)]
mock_model = MagicMock()
mock_model.embed.return_value = iter(vecs)
embedder._model = mock_model
result = embedder.embed_batch(["a", "b", "c"])
self.assertIsInstance(result, list)
self.assertEqual(len(result), 3)
for arr in result:
self.assertIsInstance(arr, np.ndarray)
self.assertEqual(arr.dtype, np.float32)
class TestEmbedProfiles(unittest.TestCase):
def test_embed_profiles_all_have_valid_keys(self):
expected_keys = {"small", "base", "large", "code"}
self.assertEqual(set(EMBED_PROFILES.keys()), expected_keys)
def test_embed_profiles_model_ids_non_empty(self):
for key, model_id in EMBED_PROFILES.items():
self.assertIsInstance(model_id, str, msg=f"{key} model id should be str")
self.assertTrue(len(model_id) > 0, msg=f"{key} model id should be non-empty")
class TestBaseEmbedderAbstract(unittest.TestCase):
def test_base_embedder_is_abstract(self):
with self.assertRaises(TypeError):
BaseEmbedder() # type: ignore[abstract]
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,179 @@
from __future__ import annotations
import types
from unittest.mock import MagicMock, patch
import pytest
from codexlens.config import Config
from codexlens.rerank.base import BaseReranker
from codexlens.rerank.local import FastEmbedReranker
from codexlens.rerank.api import APIReranker
# ---------------------------------------------------------------------------
# BaseReranker
# ---------------------------------------------------------------------------
def test_base_reranker_is_abstract():
with pytest.raises(TypeError):
BaseReranker() # type: ignore[abstract]
# ---------------------------------------------------------------------------
# FastEmbedReranker
# ---------------------------------------------------------------------------
def _make_rerank_result(index: int, score: float) -> object:
obj = types.SimpleNamespace(index=index, score=score)
return obj
def test_local_reranker_score_pairs_length():
config = Config()
reranker = FastEmbedReranker(config)
mock_results = [
_make_rerank_result(0, 0.9),
_make_rerank_result(1, 0.5),
_make_rerank_result(2, 0.1),
]
mock_model = MagicMock()
mock_model.rerank.return_value = iter(mock_results)
reranker._model = mock_model
docs = ["doc0", "doc1", "doc2"]
scores = reranker.score_pairs("query", docs)
assert len(scores) == 3
def test_local_reranker_preserves_order():
config = Config()
reranker = FastEmbedReranker(config)
# rerank returns results in reverse order (index 2, 1, 0)
mock_results = [
_make_rerank_result(2, 0.1),
_make_rerank_result(1, 0.5),
_make_rerank_result(0, 0.9),
]
mock_model = MagicMock()
mock_model.rerank.return_value = iter(mock_results)
reranker._model = mock_model
docs = ["doc0", "doc1", "doc2"]
scores = reranker.score_pairs("query", docs)
assert scores[0] == pytest.approx(0.9)
assert scores[1] == pytest.approx(0.5)
assert scores[2] == pytest.approx(0.1)
# ---------------------------------------------------------------------------
# APIReranker
# ---------------------------------------------------------------------------
def _make_config(max_tokens_per_batch: int = 512) -> Config:
return Config(
reranker_api_url="https://api.example.com",
reranker_api_key="test-key",
reranker_api_model="test-model",
reranker_api_max_tokens_per_batch=max_tokens_per_batch,
)
def test_api_reranker_batch_splitting():
config = _make_config(max_tokens_per_batch=512)
with patch("httpx.Client"):
reranker = APIReranker(config)
# 10 docs, each ~200 tokens (800 chars)
docs = ["x" * 800] * 10
batches = reranker._split_batches(docs, max_tokens=512)
# Each doc is 200 tokens; batches should have at most 2 docs (200+200=400 <= 512, 400+200=600 > 512)
assert len(batches) > 1
for batch in batches:
total = sum(len(text) // 4 for _, text in batch)
assert total <= 512 or len(batch) == 1
def test_api_reranker_retry_on_429():
config = _make_config()
mock_429 = MagicMock()
mock_429.status_code = 429
mock_200 = MagicMock()
mock_200.status_code = 200
mock_200.json.return_value = {
"results": [
{"index": 0, "relevance_score": 0.8},
{"index": 1, "relevance_score": 0.3},
]
}
mock_200.raise_for_status = MagicMock()
with patch("httpx.Client") as mock_client_cls:
mock_client = MagicMock()
mock_client_cls.return_value = mock_client
mock_client.post.side_effect = [mock_429, mock_429, mock_200]
reranker = APIReranker(config)
with patch("time.sleep"):
result = reranker._call_api_with_retry(
"query",
[(0, "doc0"), (1, "doc1")],
max_retries=3,
)
assert mock_client.post.call_count == 3
assert 0 in result
assert 1 in result
def test_api_reranker_merge_batches():
config = _make_config(max_tokens_per_batch=100)
# 4 docs of 25 tokens each (100 chars); each batch holds at most 4 docs
# Use smaller docs to force 2 batches: 2 docs per batch (50 tokens each = 200 chars)
docs = ["x" * 200] * 4 # 50 tokens each; 50+50=100 <= 100, 100+50=150 > 100 -> 2 per batch
batch0_response = MagicMock()
batch0_response.status_code = 200
batch0_response.json.return_value = {
"results": [
{"index": 0, "relevance_score": 0.9},
{"index": 1, "relevance_score": 0.8},
]
}
batch0_response.raise_for_status = MagicMock()
batch1_response = MagicMock()
batch1_response.status_code = 200
batch1_response.json.return_value = {
"results": [
{"index": 0, "relevance_score": 0.7},
{"index": 1, "relevance_score": 0.6},
]
}
batch1_response.raise_for_status = MagicMock()
with patch("httpx.Client") as mock_client_cls:
mock_client = MagicMock()
mock_client_cls.return_value = mock_client
mock_client.post.side_effect = [batch0_response, batch1_response]
reranker = APIReranker(config)
with patch("time.sleep"):
scores = reranker.score_pairs("query", docs)
assert len(scores) == 4
# All original indices should have scores
assert all(s > 0 for s in scores)

View File

@@ -0,0 +1,156 @@
"""Unit tests for search layer: FTSEngine, fusion, and SearchPipeline."""
from __future__ import annotations
from unittest.mock import MagicMock
import pytest
from codexlens.search.fts import FTSEngine
from codexlens.search.fusion import (
DEFAULT_WEIGHTS,
QueryIntent,
detect_query_intent,
get_adaptive_weights,
reciprocal_rank_fusion,
)
from codexlens.search.pipeline import SearchPipeline, SearchResult
from codexlens.config import Config
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def make_fts(docs: list[tuple[int, str, str]] | None = None) -> FTSEngine:
"""Create an in-memory FTSEngine and optionally add documents."""
engine = FTSEngine(":memory:")
if docs:
engine.add_documents(docs)
return engine
# ---------------------------------------------------------------------------
# FTSEngine tests
# ---------------------------------------------------------------------------
def test_fts_add_and_exact_search():
docs = [
(1, "a.py", "def authenticate user password login"),
(2, "b.py", "connect to database with credentials"),
(3, "c.py", "render template html response"),
]
engine = make_fts(docs)
results = engine.exact_search("authenticate", top_k=10)
ids = [r[0] for r in results]
assert 1 in ids, "doc 1 should match 'authenticate'"
assert 2 not in ids or results[0][0] == 1 # doc 1 must rank higher
def test_fts_fuzzy_search_prefix():
docs = [
(10, "auth.py", "authentication token refresh"),
(11, "db.py", "database connection pool"),
(12, "ui.py", "render button click handler"),
]
engine = make_fts(docs)
# Prefix 'auth' should match 'authentication' in doc 10
results = engine.fuzzy_search("auth", top_k=10)
ids = [r[0] for r in results]
assert 10 in ids, "prefix 'auth' should match doc 10 with 'authentication'"
# ---------------------------------------------------------------------------
# RRF fusion tests
# ---------------------------------------------------------------------------
def test_rrf_fusion_ordering():
"""When two sources agree on top-1, it should rank first in fused result."""
source_a = [(1, 0.9), (2, 0.5), (3, 0.2)]
source_b = [(1, 0.8), (3, 0.6), (2, 0.1)]
fused = reciprocal_rank_fusion({"a": source_a, "b": source_b})
assert fused[0][0] == 1, "doc 1 agreed top by both sources must rank first"
def test_rrf_equal_weight_default():
"""Calling with None weights should use DEFAULT_WEIGHTS shape (not crash)."""
source_exact = [(5, 1.0), (6, 0.8)]
source_vector = [(6, 0.9), (5, 0.7)]
# Should not raise and should return results
fused = reciprocal_rank_fusion(
{"exact": source_exact, "vector": source_vector},
weights=None,
)
assert len(fused) == 2
ids = [r[0] for r in fused]
assert 5 in ids and 6 in ids
# ---------------------------------------------------------------------------
# detect_query_intent tests
# ---------------------------------------------------------------------------
def test_detect_intent_code_symbol():
assert detect_query_intent("def authenticate()") == QueryIntent.CODE_SYMBOL
def test_detect_intent_natural():
assert detect_query_intent("how do I authenticate users") == QueryIntent.NATURAL_LANGUAGE
# ---------------------------------------------------------------------------
# SearchPipeline tests
# ---------------------------------------------------------------------------
def _make_pipeline(fts: FTSEngine, top_k: int = 5) -> SearchPipeline:
"""Build a SearchPipeline with mocked heavy components."""
cfg = Config.small()
cfg.reranker_top_k = top_k
embedder = MagicMock()
embedder.embed.return_value = [[0.1] * cfg.embed_dim]
binary_store = MagicMock()
binary_store.coarse_search.return_value = ([1, 2, 3], None)
ann_index = MagicMock()
ann_index.fine_search.return_value = ([1, 2, 3], [0.9, 0.8, 0.7])
reranker = MagicMock()
# Return a score for each content string passed
reranker.score_pairs.side_effect = lambda q, contents: [0.9 - i * 0.1 for i in range(len(contents))]
return SearchPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
reranker=reranker,
fts=fts,
config=cfg,
)
def test_pipeline_search_returns_results():
docs = [
(1, "a.py", "test content alpha"),
(2, "b.py", "test content beta"),
(3, "c.py", "test content gamma"),
]
fts = make_fts(docs)
pipeline = _make_pipeline(fts)
results = pipeline.search("test")
assert len(results) > 0
assert all(isinstance(r, SearchResult) for r in results)
def test_pipeline_top_k_limit():
docs = [
(1, "a.py", "hello world one"),
(2, "b.py", "hello world two"),
(3, "c.py", "hello world three"),
(4, "d.py", "hello world four"),
(5, "e.py", "hello world five"),
]
fts = make_fts(docs)
pipeline = _make_pipeline(fts, top_k=2)
results = pipeline.search("hello", top_k=2)
assert len(results) <= 2, "pipeline must respect top_k limit"