Files
Claude-Code-Workflow/codex-lens/docs/HYBRID_SEARCH_ARCHITECTURE.md
catlog22 35485bbbb1 feat: Enhance navigation and cleanup for graph explorer view
- Added a cleanup function to reset the state when navigating away from the graph explorer.
- Updated navigation logic to call the cleanup function before switching views.
- Improved internationalization by adding new translations for graph-related terms.
- Adjusted icon sizes for better UI consistency in the graph explorer.
- Implemented impact analysis button functionality in the graph explorer.
- Refactored CLI tool configuration to use updated model names.
- Enhanced CLI executor to handle prompts correctly for codex commands.
- Introduced code relationship storage for better visualization in the index tree.
- Added support for parsing Markdown and plain text files in the symbol parser.
- Updated tests to reflect changes in language detection logic.
2025-12-15 23:11:01 +08:00

24 KiB

CodexLens Hybrid Search Architecture Design

Version: 1.0
Date: 2025-12-15
Authors: Gemini + Qwen + Claude (Collaborative Design)
Status: Design Proposal


Executive Summary

本设计方案针对 CodexLens 当前文本搜索效果差、乱码问题、无增量索引等痛点,综合借鉴 Codanna (Tantivy N-gram + 复合排序) 和 Code-Index-MCP (双重索引 + AST解析) 的设计思想,提出全新的 Dual-FTS Hybrid Search 架构。

核心改进

问题 现状 目标方案
乱码 errors="ignore" 丢弃字节 chardet 编码检测 + errors="replace"
搜索效果差 单一 unicode61 分词 Dual-FTS (精确 + Trigram 模糊)
无模糊搜索 仅BM25精确匹配 复合排序 (Exact + Fuzzy + Prefix)
重复索引 全量重建 mtime 增量检测
语义割裂 FTS与向量独立 RRF 混合融合

Part 1: Architecture Overview

1.1 Target Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                         User Query: "auth login"                        │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                       Query Preprocessor (NEW)                          │
│  • CamelCase split: UserAuth → "UserAuth" OR "User Auth"                │
│  • snake_case split: user_auth → "user_auth" OR "user auth"             │
│  • Encoding normalization                                                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│   FTS Exact Search   │ │   FTS Fuzzy Search   │ │   Vector Search      │
│   (files_fts_exact)  │ │   (files_fts_fuzzy)  │ │   (VectorStore)      │
│   unicode61 + '_'    │ │   trigram tokenizer  │ │   Cosine similarity  │
│   BM25 scoring       │ │   Substring match    │ │   0.0 - 1.0 range    │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
            │                       │                       │
            │     Results E         │     Results F         │    Results V
            └───────────────────────┼───────────────────────┘
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    Ranking Fusion Engine (NEW)                          │
│  • Reciprocal Rank Fusion (RRF): score = Σ 1/(k + rank_i)               │
│  • Score normalization (BM25 unbounded → 0-1)                           │
│  • Weighted linear fusion: w1*exact + w2*fuzzy + w3*vector              │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Final Sorted Results                            │
└─────────────────────────────────────────────────────────────────────────┘

1.2 Component Architecture

codexlens/
├── storage/
│   ├── schema.py          # (NEW) Centralized schema definitions
│   ├── dir_index.py       # (MODIFY) Add Dual-FTS, incremental indexing
│   ├── sqlite_store.py    # (MODIFY) Add encoding detection
│   └── migrations/
│       └── migration_004_dual_fts.py  # (NEW) Schema migration
│
├── search/
│   ├── hybrid_search.py   # (NEW) HybridSearchEngine
│   ├── ranking.py         # (NEW) RRF and fusion algorithms
│   ├── query_parser.py    # (NEW) Query preprocessing
│   └── chain_search.py    # (MODIFY) Integrate hybrid search
│
├── parsers/
│   └── encoding.py        # (NEW) Encoding detection utility
│
└── semantic/
    └── vector_store.py    # (MODIFY) Integration with hybrid search

Part 2: Detailed Component Design

2.1 Encoding Detection Module

File: codexlens/parsers/encoding.py (NEW)

"""Robust encoding detection for file content."""
from pathlib import Path
from typing import Tuple, Optional

# Optional: chardet or charset-normalizer
try:
    import chardet
    HAS_CHARDET = True
except ImportError:
    HAS_CHARDET = False


def detect_encoding(content: bytes, default: str = "utf-8") -> str:
    """Detect encoding of byte content with fallback."""
    if HAS_CHARDET:
        result = chardet.detect(content[:10000])  # Sample first 10KB
        if result and result.get("confidence", 0) > 0.7:
            return result["encoding"] or default
    return default


def read_file_safe(path: Path) -> Tuple[str, str]:
    """Read file with encoding detection.
    
    Returns:
        Tuple of (content, detected_encoding)
    """
    raw_bytes = path.read_bytes()
    encoding = detect_encoding(raw_bytes)
    
    try:
        content = raw_bytes.decode(encoding, errors="replace")
    except (UnicodeDecodeError, LookupError):
        content = raw_bytes.decode("utf-8", errors="replace")
        encoding = "utf-8"
    
    return content, encoding

Integration Point: dir_index.py:add_file(), index_tree.py:_build_single_dir()


2.2 Dual-FTS Schema Design

File: codexlens/storage/schema.py (NEW)

"""Centralized database schema definitions for Dual-FTS architecture."""

# Schema version for migration tracking
SCHEMA_VERSION = 4

# Standard FTS5 for exact matching (code symbols, identifiers)
FTS_EXACT_SCHEMA = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_exact USING fts5(
    name, full_path UNINDEXED, content,
    content='files',
    content_rowid='id',
    tokenize="unicode61 tokenchars '_-'"
)
"""

# Trigram FTS5 for fuzzy/substring matching (requires SQLite 3.34+)
FTS_FUZZY_SCHEMA = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_fuzzy USING fts5(
    name, full_path UNINDEXED, content,
    content='files',
    content_rowid='id',
    tokenize="trigram"
)
"""

# Fallback if trigram not available
FTS_FUZZY_FALLBACK = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_fuzzy USING fts5(
    name, full_path UNINDEXED, content,
    content='files',
    content_rowid='id',
    tokenize="unicode61 tokenchars '_-' separators '.'"
)
"""

def check_trigram_support(conn) -> bool:
    """Check if SQLite supports trigram tokenizer."""
    try:
        conn.execute("CREATE VIRTUAL TABLE _test_trigram USING fts5(x, tokenize='trigram')")
        conn.execute("DROP TABLE _test_trigram")
        return True
    except Exception:
        return False


def create_dual_fts_schema(conn) -> dict:
    """Create Dual-FTS tables with fallback.
    
    Returns:
        dict with 'exact_table', 'fuzzy_table', 'trigram_enabled' keys
    """
    result = {"exact_table": "files_fts_exact", "fuzzy_table": "files_fts_fuzzy"}
    
    # Create exact FTS (always available)
    conn.execute(FTS_EXACT_SCHEMA)
    
    # Create fuzzy FTS (with trigram if supported)
    if check_trigram_support(conn):
        conn.execute(FTS_FUZZY_SCHEMA)
        result["trigram_enabled"] = True
    else:
        conn.execute(FTS_FUZZY_FALLBACK)
        result["trigram_enabled"] = False
    
    # Create triggers for dual-table sync
    conn.execute("""
        CREATE TRIGGER IF NOT EXISTS files_ai_exact AFTER INSERT ON files BEGIN
            INSERT INTO files_fts_exact(rowid, name, full_path, content) 
            VALUES (new.id, new.name, new.full_path, new.content);
        END
    """)
    conn.execute("""
        CREATE TRIGGER IF NOT EXISTS files_ai_fuzzy AFTER INSERT ON files BEGIN
            INSERT INTO files_fts_fuzzy(rowid, name, full_path, content) 
            VALUES (new.id, new.name, new.full_path, new.content);
        END
    """)
    # ... similar triggers for UPDATE and DELETE
    
    return result

2.3 Hybrid Search Engine

File: codexlens/search/hybrid_search.py (NEW)

"""Hybrid search engine combining FTS and semantic search with RRF fusion."""
from dataclasses import dataclass
from typing import List, Optional
from concurrent.futures import ThreadPoolExecutor

from codexlens.entities import SearchResult
from codexlens.search.ranking import reciprocal_rank_fusion, normalize_scores


@dataclass
class HybridSearchConfig:
    """Configuration for hybrid search."""
    enable_exact: bool = True
    enable_fuzzy: bool = True
    enable_vector: bool = True
    exact_weight: float = 0.4
    fuzzy_weight: float = 0.3
    vector_weight: float = 0.3
    rrf_k: int = 60  # RRF constant
    max_results: int = 20


class HybridSearchEngine:
    """Multi-modal search with RRF fusion."""
    
    def __init__(self, dir_index_store, vector_store=None, config: HybridSearchConfig = None):
        self.store = dir_index_store
        self.vector_store = vector_store
        self.config = config or HybridSearchConfig()
    
    def search(self, query: str, limit: int = 20) -> List[SearchResult]:
        """Execute hybrid search with parallel retrieval and RRF fusion."""
        results_map = {}
        
        # Parallel retrieval
        with ThreadPoolExecutor(max_workers=3) as executor:
            futures = {}
            
            if self.config.enable_exact:
                futures["exact"] = executor.submit(
                    self._search_exact, query, limit * 2
                )
            if self.config.enable_fuzzy:
                futures["fuzzy"] = executor.submit(
                    self._search_fuzzy, query, limit * 2
                )
            if self.config.enable_vector and self.vector_store:
                futures["vector"] = executor.submit(
                    self._search_vector, query, limit * 2
                )
            
            for name, future in futures.items():
                try:
                    results_map[name] = future.result(timeout=10)
                except Exception:
                    results_map[name] = []
        
        # Apply RRF fusion
        fused = reciprocal_rank_fusion(
            results_map,
            weights={
                "exact": self.config.exact_weight,
                "fuzzy": self.config.fuzzy_weight,
                "vector": self.config.vector_weight,
            },
            k=self.config.rrf_k
        )
        
        return fused[:limit]
    
    def _search_exact(self, query: str, limit: int) -> List[SearchResult]:
        """Exact FTS search with BM25."""
        return self.store.search_fts_exact(query, limit)
    
    def _search_fuzzy(self, query: str, limit: int) -> List[SearchResult]:
        """Fuzzy FTS search with trigram."""
        return self.store.search_fts_fuzzy(query, limit)
    
    def _search_vector(self, query: str, limit: int) -> List[SearchResult]:
        """Semantic vector search."""
        if not self.vector_store:
            return []
        return self.vector_store.search_similar(query, limit)

2.4 RRF Ranking Fusion

File: codexlens/search/ranking.py (NEW)

"""Ranking fusion algorithms for hybrid search."""
from typing import Dict, List
from collections import defaultdict

from codexlens.entities import SearchResult


def reciprocal_rank_fusion(
    results_map: Dict[str, List[SearchResult]],
    weights: Dict[str, float] = None,
    k: int = 60
) -> List[SearchResult]:
    """Reciprocal Rank Fusion (RRF) algorithm.
    
    Formula: score(d) = Σ weight_i / (k + rank_i(d))
    
    Args:
        results_map: Dict mapping source name to ranked results
        weights: Optional weights per source (default equal)
        k: RRF constant (default 60)
    
    Returns:
        Fused and re-ranked results
    """
    if weights is None:
        weights = {name: 1.0 for name in results_map}
    
    # Normalize weights
    total_weight = sum(weights.values())
    weights = {k: v / total_weight for k, v in weights.items()}
    
    # Calculate RRF scores
    rrf_scores = defaultdict(float)
    path_to_result = {}
    
    for source_name, results in results_map.items():
        weight = weights.get(source_name, 1.0)
        for rank, result in enumerate(results, start=1):
            rrf_scores[result.path] += weight / (k + rank)
            if result.path not in path_to_result:
                path_to_result[result.path] = result
    
    # Sort by RRF score
    sorted_paths = sorted(rrf_scores.keys(), key=lambda p: rrf_scores[p], reverse=True)
    
    # Build final results with updated scores
    fused_results = []
    for path in sorted_paths:
        result = path_to_result[path]
        fused_results.append(SearchResult(
            path=result.path,
            score=rrf_scores[path],
            excerpt=result.excerpt,
        ))
    
    return fused_results


def normalize_bm25_score(score: float, max_score: float = 100.0) -> float:
    """Normalize BM25 score to 0-1 range.
    
    BM25 scores are unbounded and typically negative in SQLite FTS5.
    This normalizes them for fusion with other score types.
    """
    if score >= 0:
        return 0.0
    # BM25 in SQLite is negative; more negative = better match
    return min(1.0, abs(score) / max_score)

2.5 Incremental Indexing

File: codexlens/storage/dir_index.py (MODIFY)

# Add to DirIndexStore class:

def needs_reindex(self, path: Path) -> bool:
    """Check if file needs re-indexing based on mtime.
    
    Returns:
        True if file should be reindexed, False to skip
    """
    with self._lock:
        conn = self._get_connection()
        row = conn.execute(
            "SELECT mtime FROM files WHERE full_path = ?",
            (str(path.resolve()),)
        ).fetchone()
        
        if row is None:
            return True  # New file
        
        stored_mtime = row["mtime"]
        if stored_mtime is None:
            return True
        
        try:
            current_mtime = path.stat().st_mtime
            # Allow 1ms tolerance for floating point comparison
            return abs(current_mtime - stored_mtime) > 0.001
        except OSError:
            return False  # File doesn't exist anymore


def add_file_incremental(
    self,
    file_path: Path,
    content: str,
    indexed_file: IndexedFile,
) -> Optional[int]:
    """Add file to index only if changed.
    
    Returns:
        file_id if indexed, None if skipped
    """
    if not self.needs_reindex(file_path):
        # Return existing file_id without re-indexing
        with self._lock:
            conn = self._get_connection()
            row = conn.execute(
                "SELECT id FROM files WHERE full_path = ?",
                (str(file_path.resolve()),)
            ).fetchone()
            return int(row["id"]) if row else None
    
    # Proceed with full indexing
    return self.add_file(file_path, content, indexed_file)

2.6 Query Preprocessor

File: codexlens/search/query_parser.py (NEW)

"""Query preprocessing for improved search recall."""
import re
from typing import List


def split_camel_case(text: str) -> List[str]:
    """Split CamelCase into words: UserAuth -> ['User', 'Auth']"""
    return re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', text)


def split_snake_case(text: str) -> List[str]:
    """Split snake_case into words: user_auth -> ['user', 'auth']"""
    return text.split('_')


def preprocess_query(query: str) -> str:
    """Preprocess query for better recall.
    
    Transforms:
    - UserAuth -> "UserAuth" OR "User Auth"
    - user_auth -> "user_auth" OR "user auth"
    """
    terms = []
    
    for word in query.split():
        # Handle CamelCase
        if re.match(r'^[A-Z][a-z]+[A-Z]', word):
            parts = split_camel_case(word)
            terms.append(f'"{word}"')  # Original
            terms.append(f'"{" ".join(parts)}"')  # Split
        
        # Handle snake_case
        elif '_' in word:
            parts = split_snake_case(word)
            terms.append(f'"{word}"')  # Original
            terms.append(f'"{" ".join(parts)}"')  # Split
        
        else:
            terms.append(word)
    
    # Combine with OR for recall
    return " OR ".join(terms) if len(terms) > 1 else terms[0]

Part 3: Database Schema Changes

3.1 New Tables

-- Exact FTS table (code-friendly tokenizer)
CREATE VIRTUAL TABLE files_fts_exact USING fts5(
    name, full_path UNINDEXED, content,
    content='files',
    content_rowid='id',
    tokenize="unicode61 tokenchars '_-'"
);

-- Fuzzy FTS table (trigram for substring matching)
CREATE VIRTUAL TABLE files_fts_fuzzy USING fts5(
    name, full_path UNINDEXED, content,
    content='files',
    content_rowid='id',
    tokenize="trigram"
);

-- File hash for robust change detection (optional enhancement)
ALTER TABLE files ADD COLUMN content_hash TEXT;
CREATE INDEX idx_files_hash ON files(content_hash);

3.2 Migration Script

File: codexlens/storage/migrations/migration_004_dual_fts.py (NEW)

"""Migration 004: Dual-FTS architecture."""

def upgrade(db_conn):
    """Upgrade to Dual-FTS schema."""
    cursor = db_conn.cursor()
    
    # Check current schema
    tables = cursor.execute(
        "SELECT name FROM sqlite_master WHERE type='table' AND name LIKE 'files_fts%'"
    ).fetchall()
    existing = {t[0] for t in tables}
    
    # Drop legacy single FTS table
    if "files_fts" in existing and "files_fts_exact" not in existing:
        cursor.execute("DROP TABLE IF EXISTS files_fts")
    
    # Create new Dual-FTS tables
    from codexlens.storage.schema import create_dual_fts_schema
    result = create_dual_fts_schema(db_conn)
    
    # Rebuild indexes from existing content
    cursor.execute("""
        INSERT INTO files_fts_exact(rowid, name, full_path, content)
        SELECT id, name, full_path, content FROM files
    """)
    cursor.execute("""
        INSERT INTO files_fts_fuzzy(rowid, name, full_path, content)
        SELECT id, name, full_path, content FROM files
    """)
    
    db_conn.commit()
    return result

Part 4: API Contracts

4.1 Search API

# New unified search interface
class SearchOptions:
    query: str
    limit: int = 20
    offset: int = 0
    enable_exact: bool = True      # FTS exact matching
    enable_fuzzy: bool = True      # Trigram fuzzy matching  
    enable_vector: bool = False    # Semantic vector search
    exact_weight: float = 0.4
    fuzzy_weight: float = 0.3
    vector_weight: float = 0.3

# API endpoint signature
def search(options: SearchOptions) -> SearchResponse:
    """Unified hybrid search."""
    pass

class SearchResponse:
    results: List[SearchResult]
    total: int
    search_modes: List[str]  # ["exact", "fuzzy", "vector"]
    trigram_available: bool

4.2 Indexing API

# Enhanced indexing with incremental support
class IndexOptions:
    path: Path
    incremental: bool = True     # Skip unchanged files
    force: bool = False          # Force reindex all
    detect_encoding: bool = True # Auto-detect file encoding

def index_directory(options: IndexOptions) -> IndexResult:
    """Index directory with incremental support."""
    pass

class IndexResult:
    total_files: int
    indexed_files: int
    skipped_files: int  # Unchanged files skipped
    encoding_errors: int

Part 5: Implementation Roadmap

Phase 1: Foundation (Week 1)

  • Implement encoding detection module
  • Update file reading in dir_index.py and index_tree.py
  • Add chardet/charset-normalizer dependency
  • Write unit tests for encoding detection

Phase 2: Dual-FTS (Week 2)

  • Create schema.py with Dual-FTS definitions
  • Implement trigram compatibility check
  • Write migration script
  • Update DirIndexStore with dual search methods
  • Test FTS5 trigram on target platforms

Phase 3: Hybrid Search (Week 3)

  • Implement HybridSearchEngine
  • Implement ranking.py with RRF
  • Create query_parser.py
  • Integrate with ChainSearchEngine
  • Write integration tests

Phase 4: Incremental Indexing (Week 4)

  • Add needs_reindex() method
  • Implement add_file_incremental()
  • Update IndexTreeBuilder to use incremental API
  • Add optional content hash column
  • Performance benchmarking

Phase 5: Vector Integration (Week 5)

  • Update VectorStore for hybrid integration
  • Implement vector search in HybridSearchEngine
  • Tune RRF weights for optimal results
  • End-to-end testing

Part 6: Performance Considerations

6.1 Indexing Performance

  • Incremental indexing: Skip ~90% of files on re-index
  • Parallel file processing: ThreadPoolExecutor for parsing
  • Batch commits: Commit every 100 files to reduce I/O

6.2 Search Performance

  • Parallel retrieval: Execute FTS + Vector searches concurrently
  • Early termination: Stop after finding enough high-confidence matches
  • Result caching: LRU cache for frequent queries

6.3 Storage Overhead

  • Dual-FTS: ~2x FTS index size (exact + fuzzy)
  • Trigram: ~3-5x content size (due to trigram expansion)
  • Mitigation: Optional fuzzy index, configurable per project

Part 7: Risk Assessment

Risk Probability Impact Mitigation
SQLite trigram not available Medium High Fallback to extended unicode61
Performance degradation Low Medium Parallel search, caching
Migration data loss Low High Backup before migration
Encoding detection false positives Medium Low Use replace mode, log warnings

Appendix: Reference Project Learnings

From Codanna (Rust)

  • N-gram tokenizer (3-10): Enables partial matching for code symbols
  • Compound BooleanQuery: Combines exact + fuzzy + prefix in single query
  • File hash change detection: More robust than mtime alone

From Code-Index-MCP (Python)

  • Dual-index architecture: Fast shallow index + rich deep index
  • External tool integration: Wrap ripgrep for performance
  • AST-based parsing: Single-pass symbol extraction
  • ReDoS protection: Validate regex patterns before execution