feat: Enhance navigation and cleanup for graph explorer view

- Added a cleanup function to reset the state when navigating away from the graph explorer.
- Updated navigation logic to call the cleanup function before switching views.
- Improved internationalization by adding new translations for graph-related terms.
- Adjusted icon sizes for better UI consistency in the graph explorer.
- Implemented impact analysis button functionality in the graph explorer.
- Refactored CLI tool configuration to use updated model names.
- Enhanced CLI executor to handle prompts correctly for codex commands.
- Introduced code relationship storage for better visualization in the index tree.
- Added support for parsing Markdown and plain text files in the symbol parser.
- Updated tests to reflect changes in language detection logic.
This commit is contained in:
catlog22
2025-12-15 23:11:01 +08:00
parent 894b93e08d
commit 35485bbbb1
35 changed files with 3348 additions and 228 deletions

View File

@@ -0,0 +1,711 @@
# CodexLens Hybrid Search Architecture Design
> **Version**: 1.0
> **Date**: 2025-12-15
> **Authors**: Gemini + Qwen + Claude (Collaborative Design)
> **Status**: Design Proposal
---
## Executive Summary
本设计方案针对 CodexLens 当前文本搜索效果差、乱码问题、无增量索引等痛点,综合借鉴 **Codanna** (Tantivy N-gram + 复合排序) 和 **Code-Index-MCP** (双重索引 + AST解析) 的设计思想,提出全新的 **Dual-FTS Hybrid Search** 架构。
### 核心改进
| 问题 | 现状 | 目标方案 |
|------|------|----------|
| 乱码 | `errors="ignore"` 丢弃字节 | chardet 编码检测 + `errors="replace"` |
| 搜索效果差 | 单一 unicode61 分词 | Dual-FTS (精确 + Trigram 模糊) |
| 无模糊搜索 | 仅BM25精确匹配 | 复合排序 (Exact + Fuzzy + Prefix) |
| 重复索引 | 全量重建 | mtime 增量检测 |
| 语义割裂 | FTS与向量独立 | RRF 混合融合 |
---
## Part 1: Architecture Overview
### 1.1 Target Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────────┐
│ User Query: "auth login" │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ Query Preprocessor (NEW) │
│ • CamelCase split: UserAuth → "UserAuth" OR "User Auth" │
│ • snake_case split: user_auth → "user_auth" OR "user auth" │
│ • Encoding normalization │
└─────────────────────────────────────────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ FTS Exact Search │ │ FTS Fuzzy Search │ │ Vector Search │
│ (files_fts_exact) │ │ (files_fts_fuzzy) │ │ (VectorStore) │
│ unicode61 + '_' │ │ trigram tokenizer │ │ Cosine similarity │
│ BM25 scoring │ │ Substring match │ │ 0.0 - 1.0 range │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘
│ │ │
│ Results E │ Results F │ Results V
└───────────────────────┼───────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ Ranking Fusion Engine (NEW) │
│ • Reciprocal Rank Fusion (RRF): score = Σ 1/(k + rank_i) │
│ • Score normalization (BM25 unbounded → 0-1) │
│ • Weighted linear fusion: w1*exact + w2*fuzzy + w3*vector │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ Final Sorted Results │
└─────────────────────────────────────────────────────────────────────────┘
```
### 1.2 Component Architecture
```
codexlens/
├── storage/
│ ├── schema.py # (NEW) Centralized schema definitions
│ ├── dir_index.py # (MODIFY) Add Dual-FTS, incremental indexing
│ ├── sqlite_store.py # (MODIFY) Add encoding detection
│ └── migrations/
│ └── migration_004_dual_fts.py # (NEW) Schema migration
├── search/
│ ├── hybrid_search.py # (NEW) HybridSearchEngine
│ ├── ranking.py # (NEW) RRF and fusion algorithms
│ ├── query_parser.py # (NEW) Query preprocessing
│ └── chain_search.py # (MODIFY) Integrate hybrid search
├── parsers/
│ └── encoding.py # (NEW) Encoding detection utility
└── semantic/
└── vector_store.py # (MODIFY) Integration with hybrid search
```
---
## Part 2: Detailed Component Design
### 2.1 Encoding Detection Module
**File**: `codexlens/parsers/encoding.py` (NEW)
```python
"""Robust encoding detection for file content."""
from pathlib import Path
from typing import Tuple, Optional
# Optional: chardet or charset-normalizer
try:
import chardet
HAS_CHARDET = True
except ImportError:
HAS_CHARDET = False
def detect_encoding(content: bytes, default: str = "utf-8") -> str:
"""Detect encoding of byte content with fallback."""
if HAS_CHARDET:
result = chardet.detect(content[:10000]) # Sample first 10KB
if result and result.get("confidence", 0) > 0.7:
return result["encoding"] or default
return default
def read_file_safe(path: Path) -> Tuple[str, str]:
"""Read file with encoding detection.
Returns:
Tuple of (content, detected_encoding)
"""
raw_bytes = path.read_bytes()
encoding = detect_encoding(raw_bytes)
try:
content = raw_bytes.decode(encoding, errors="replace")
except (UnicodeDecodeError, LookupError):
content = raw_bytes.decode("utf-8", errors="replace")
encoding = "utf-8"
return content, encoding
```
**Integration Point**: `dir_index.py:add_file()`, `index_tree.py:_build_single_dir()`
---
### 2.2 Dual-FTS Schema Design
**File**: `codexlens/storage/schema.py` (NEW)
```python
"""Centralized database schema definitions for Dual-FTS architecture."""
# Schema version for migration tracking
SCHEMA_VERSION = 4
# Standard FTS5 for exact matching (code symbols, identifiers)
FTS_EXACT_SCHEMA = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_exact USING fts5(
name, full_path UNINDEXED, content,
content='files',
content_rowid='id',
tokenize="unicode61 tokenchars '_-'"
)
"""
# Trigram FTS5 for fuzzy/substring matching (requires SQLite 3.34+)
FTS_FUZZY_SCHEMA = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_fuzzy USING fts5(
name, full_path UNINDEXED, content,
content='files',
content_rowid='id',
tokenize="trigram"
)
"""
# Fallback if trigram not available
FTS_FUZZY_FALLBACK = """
CREATE VIRTUAL TABLE IF NOT EXISTS files_fts_fuzzy USING fts5(
name, full_path UNINDEXED, content,
content='files',
content_rowid='id',
tokenize="unicode61 tokenchars '_-' separators '.'"
)
"""
def check_trigram_support(conn) -> bool:
"""Check if SQLite supports trigram tokenizer."""
try:
conn.execute("CREATE VIRTUAL TABLE _test_trigram USING fts5(x, tokenize='trigram')")
conn.execute("DROP TABLE _test_trigram")
return True
except Exception:
return False
def create_dual_fts_schema(conn) -> dict:
"""Create Dual-FTS tables with fallback.
Returns:
dict with 'exact_table', 'fuzzy_table', 'trigram_enabled' keys
"""
result = {"exact_table": "files_fts_exact", "fuzzy_table": "files_fts_fuzzy"}
# Create exact FTS (always available)
conn.execute(FTS_EXACT_SCHEMA)
# Create fuzzy FTS (with trigram if supported)
if check_trigram_support(conn):
conn.execute(FTS_FUZZY_SCHEMA)
result["trigram_enabled"] = True
else:
conn.execute(FTS_FUZZY_FALLBACK)
result["trigram_enabled"] = False
# Create triggers for dual-table sync
conn.execute("""
CREATE TRIGGER IF NOT EXISTS files_ai_exact AFTER INSERT ON files BEGIN
INSERT INTO files_fts_exact(rowid, name, full_path, content)
VALUES (new.id, new.name, new.full_path, new.content);
END
""")
conn.execute("""
CREATE TRIGGER IF NOT EXISTS files_ai_fuzzy AFTER INSERT ON files BEGIN
INSERT INTO files_fts_fuzzy(rowid, name, full_path, content)
VALUES (new.id, new.name, new.full_path, new.content);
END
""")
# ... similar triggers for UPDATE and DELETE
return result
```
---
### 2.3 Hybrid Search Engine
**File**: `codexlens/search/hybrid_search.py` (NEW)
```python
"""Hybrid search engine combining FTS and semantic search with RRF fusion."""
from dataclasses import dataclass
from typing import List, Optional
from concurrent.futures import ThreadPoolExecutor
from codexlens.entities import SearchResult
from codexlens.search.ranking import reciprocal_rank_fusion, normalize_scores
@dataclass
class HybridSearchConfig:
"""Configuration for hybrid search."""
enable_exact: bool = True
enable_fuzzy: bool = True
enable_vector: bool = True
exact_weight: float = 0.4
fuzzy_weight: float = 0.3
vector_weight: float = 0.3
rrf_k: int = 60 # RRF constant
max_results: int = 20
class HybridSearchEngine:
"""Multi-modal search with RRF fusion."""
def __init__(self, dir_index_store, vector_store=None, config: HybridSearchConfig = None):
self.store = dir_index_store
self.vector_store = vector_store
self.config = config or HybridSearchConfig()
def search(self, query: str, limit: int = 20) -> List[SearchResult]:
"""Execute hybrid search with parallel retrieval and RRF fusion."""
results_map = {}
# Parallel retrieval
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {}
if self.config.enable_exact:
futures["exact"] = executor.submit(
self._search_exact, query, limit * 2
)
if self.config.enable_fuzzy:
futures["fuzzy"] = executor.submit(
self._search_fuzzy, query, limit * 2
)
if self.config.enable_vector and self.vector_store:
futures["vector"] = executor.submit(
self._search_vector, query, limit * 2
)
for name, future in futures.items():
try:
results_map[name] = future.result(timeout=10)
except Exception:
results_map[name] = []
# Apply RRF fusion
fused = reciprocal_rank_fusion(
results_map,
weights={
"exact": self.config.exact_weight,
"fuzzy": self.config.fuzzy_weight,
"vector": self.config.vector_weight,
},
k=self.config.rrf_k
)
return fused[:limit]
def _search_exact(self, query: str, limit: int) -> List[SearchResult]:
"""Exact FTS search with BM25."""
return self.store.search_fts_exact(query, limit)
def _search_fuzzy(self, query: str, limit: int) -> List[SearchResult]:
"""Fuzzy FTS search with trigram."""
return self.store.search_fts_fuzzy(query, limit)
def _search_vector(self, query: str, limit: int) -> List[SearchResult]:
"""Semantic vector search."""
if not self.vector_store:
return []
return self.vector_store.search_similar(query, limit)
```
---
### 2.4 RRF Ranking Fusion
**File**: `codexlens/search/ranking.py` (NEW)
```python
"""Ranking fusion algorithms for hybrid search."""
from typing import Dict, List
from collections import defaultdict
from codexlens.entities import SearchResult
def reciprocal_rank_fusion(
results_map: Dict[str, List[SearchResult]],
weights: Dict[str, float] = None,
k: int = 60
) -> List[SearchResult]:
"""Reciprocal Rank Fusion (RRF) algorithm.
Formula: score(d) = Σ weight_i / (k + rank_i(d))
Args:
results_map: Dict mapping source name to ranked results
weights: Optional weights per source (default equal)
k: RRF constant (default 60)
Returns:
Fused and re-ranked results
"""
if weights is None:
weights = {name: 1.0 for name in results_map}
# Normalize weights
total_weight = sum(weights.values())
weights = {k: v / total_weight for k, v in weights.items()}
# Calculate RRF scores
rrf_scores = defaultdict(float)
path_to_result = {}
for source_name, results in results_map.items():
weight = weights.get(source_name, 1.0)
for rank, result in enumerate(results, start=1):
rrf_scores[result.path] += weight / (k + rank)
if result.path not in path_to_result:
path_to_result[result.path] = result
# Sort by RRF score
sorted_paths = sorted(rrf_scores.keys(), key=lambda p: rrf_scores[p], reverse=True)
# Build final results with updated scores
fused_results = []
for path in sorted_paths:
result = path_to_result[path]
fused_results.append(SearchResult(
path=result.path,
score=rrf_scores[path],
excerpt=result.excerpt,
))
return fused_results
def normalize_bm25_score(score: float, max_score: float = 100.0) -> float:
"""Normalize BM25 score to 0-1 range.
BM25 scores are unbounded and typically negative in SQLite FTS5.
This normalizes them for fusion with other score types.
"""
if score >= 0:
return 0.0
# BM25 in SQLite is negative; more negative = better match
return min(1.0, abs(score) / max_score)
```
---
### 2.5 Incremental Indexing
**File**: `codexlens/storage/dir_index.py` (MODIFY)
```python
# Add to DirIndexStore class:
def needs_reindex(self, path: Path) -> bool:
"""Check if file needs re-indexing based on mtime.
Returns:
True if file should be reindexed, False to skip
"""
with self._lock:
conn = self._get_connection()
row = conn.execute(
"SELECT mtime FROM files WHERE full_path = ?",
(str(path.resolve()),)
).fetchone()
if row is None:
return True # New file
stored_mtime = row["mtime"]
if stored_mtime is None:
return True
try:
current_mtime = path.stat().st_mtime
# Allow 1ms tolerance for floating point comparison
return abs(current_mtime - stored_mtime) > 0.001
except OSError:
return False # File doesn't exist anymore
def add_file_incremental(
self,
file_path: Path,
content: str,
indexed_file: IndexedFile,
) -> Optional[int]:
"""Add file to index only if changed.
Returns:
file_id if indexed, None if skipped
"""
if not self.needs_reindex(file_path):
# Return existing file_id without re-indexing
with self._lock:
conn = self._get_connection()
row = conn.execute(
"SELECT id FROM files WHERE full_path = ?",
(str(file_path.resolve()),)
).fetchone()
return int(row["id"]) if row else None
# Proceed with full indexing
return self.add_file(file_path, content, indexed_file)
```
---
### 2.6 Query Preprocessor
**File**: `codexlens/search/query_parser.py` (NEW)
```python
"""Query preprocessing for improved search recall."""
import re
from typing import List
def split_camel_case(text: str) -> List[str]:
"""Split CamelCase into words: UserAuth -> ['User', 'Auth']"""
return re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', text)
def split_snake_case(text: str) -> List[str]:
"""Split snake_case into words: user_auth -> ['user', 'auth']"""
return text.split('_')
def preprocess_query(query: str) -> str:
"""Preprocess query for better recall.
Transforms:
- UserAuth -> "UserAuth" OR "User Auth"
- user_auth -> "user_auth" OR "user auth"
"""
terms = []
for word in query.split():
# Handle CamelCase
if re.match(r'^[A-Z][a-z]+[A-Z]', word):
parts = split_camel_case(word)
terms.append(f'"{word}"') # Original
terms.append(f'"{" ".join(parts)}"') # Split
# Handle snake_case
elif '_' in word:
parts = split_snake_case(word)
terms.append(f'"{word}"') # Original
terms.append(f'"{" ".join(parts)}"') # Split
else:
terms.append(word)
# Combine with OR for recall
return " OR ".join(terms) if len(terms) > 1 else terms[0]
```
---
## Part 3: Database Schema Changes
### 3.1 New Tables
```sql
-- Exact FTS table (code-friendly tokenizer)
CREATE VIRTUAL TABLE files_fts_exact USING fts5(
name, full_path UNINDEXED, content,
content='files',
content_rowid='id',
tokenize="unicode61 tokenchars '_-'"
);
-- Fuzzy FTS table (trigram for substring matching)
CREATE VIRTUAL TABLE files_fts_fuzzy USING fts5(
name, full_path UNINDEXED, content,
content='files',
content_rowid='id',
tokenize="trigram"
);
-- File hash for robust change detection (optional enhancement)
ALTER TABLE files ADD COLUMN content_hash TEXT;
CREATE INDEX idx_files_hash ON files(content_hash);
```
### 3.2 Migration Script
**File**: `codexlens/storage/migrations/migration_004_dual_fts.py` (NEW)
```python
"""Migration 004: Dual-FTS architecture."""
def upgrade(db_conn):
"""Upgrade to Dual-FTS schema."""
cursor = db_conn.cursor()
# Check current schema
tables = cursor.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name LIKE 'files_fts%'"
).fetchall()
existing = {t[0] for t in tables}
# Drop legacy single FTS table
if "files_fts" in existing and "files_fts_exact" not in existing:
cursor.execute("DROP TABLE IF EXISTS files_fts")
# Create new Dual-FTS tables
from codexlens.storage.schema import create_dual_fts_schema
result = create_dual_fts_schema(db_conn)
# Rebuild indexes from existing content
cursor.execute("""
INSERT INTO files_fts_exact(rowid, name, full_path, content)
SELECT id, name, full_path, content FROM files
""")
cursor.execute("""
INSERT INTO files_fts_fuzzy(rowid, name, full_path, content)
SELECT id, name, full_path, content FROM files
""")
db_conn.commit()
return result
```
---
## Part 4: API Contracts
### 4.1 Search API
```python
# New unified search interface
class SearchOptions:
query: str
limit: int = 20
offset: int = 0
enable_exact: bool = True # FTS exact matching
enable_fuzzy: bool = True # Trigram fuzzy matching
enable_vector: bool = False # Semantic vector search
exact_weight: float = 0.4
fuzzy_weight: float = 0.3
vector_weight: float = 0.3
# API endpoint signature
def search(options: SearchOptions) -> SearchResponse:
"""Unified hybrid search."""
pass
class SearchResponse:
results: List[SearchResult]
total: int
search_modes: List[str] # ["exact", "fuzzy", "vector"]
trigram_available: bool
```
### 4.2 Indexing API
```python
# Enhanced indexing with incremental support
class IndexOptions:
path: Path
incremental: bool = True # Skip unchanged files
force: bool = False # Force reindex all
detect_encoding: bool = True # Auto-detect file encoding
def index_directory(options: IndexOptions) -> IndexResult:
"""Index directory with incremental support."""
pass
class IndexResult:
total_files: int
indexed_files: int
skipped_files: int # Unchanged files skipped
encoding_errors: int
```
---
## Part 5: Implementation Roadmap
### Phase 1: Foundation (Week 1)
- [ ] Implement encoding detection module
- [ ] Update file reading in `dir_index.py` and `index_tree.py`
- [ ] Add chardet/charset-normalizer dependency
- [ ] Write unit tests for encoding detection
### Phase 2: Dual-FTS (Week 2)
- [ ] Create `schema.py` with Dual-FTS definitions
- [ ] Implement trigram compatibility check
- [ ] Write migration script
- [ ] Update `DirIndexStore` with dual search methods
- [ ] Test FTS5 trigram on target platforms
### Phase 3: Hybrid Search (Week 3)
- [ ] Implement `HybridSearchEngine`
- [ ] Implement `ranking.py` with RRF
- [ ] Create `query_parser.py`
- [ ] Integrate with `ChainSearchEngine`
- [ ] Write integration tests
### Phase 4: Incremental Indexing (Week 4)
- [ ] Add `needs_reindex()` method
- [ ] Implement `add_file_incremental()`
- [ ] Update `IndexTreeBuilder` to use incremental API
- [ ] Add optional content hash column
- [ ] Performance benchmarking
### Phase 5: Vector Integration (Week 5)
- [ ] Update `VectorStore` for hybrid integration
- [ ] Implement vector search in `HybridSearchEngine`
- [ ] Tune RRF weights for optimal results
- [ ] End-to-end testing
---
## Part 6: Performance Considerations
### 6.1 Indexing Performance
- **Incremental indexing**: Skip ~90% of files on re-index
- **Parallel file processing**: ThreadPoolExecutor for parsing
- **Batch commits**: Commit every 100 files to reduce I/O
### 6.2 Search Performance
- **Parallel retrieval**: Execute FTS + Vector searches concurrently
- **Early termination**: Stop after finding enough high-confidence matches
- **Result caching**: LRU cache for frequent queries
### 6.3 Storage Overhead
- **Dual-FTS**: ~2x FTS index size (exact + fuzzy)
- **Trigram**: ~3-5x content size (due to trigram expansion)
- **Mitigation**: Optional fuzzy index, configurable per project
---
## Part 7: Risk Assessment
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| SQLite trigram not available | Medium | High | Fallback to extended unicode61 |
| Performance degradation | Low | Medium | Parallel search, caching |
| Migration data loss | Low | High | Backup before migration |
| Encoding detection false positives | Medium | Low | Use replace mode, log warnings |
---
## Appendix: Reference Project Learnings
### From Codanna (Rust)
- **N-gram tokenizer (3-10)**: Enables partial matching for code symbols
- **Compound BooleanQuery**: Combines exact + fuzzy + prefix in single query
- **File hash change detection**: More robust than mtime alone
### From Code-Index-MCP (Python)
- **Dual-index architecture**: Fast shallow index + rich deep index
- **External tool integration**: Wrap ripgrep for performance
- **AST-based parsing**: Single-pass symbol extraction
- **ReDoS protection**: Validate regex patterns before execution