Add comprehensive tests for schema cleanup migration and search comparison

- Implement tests for migration 005 to verify removal of deprecated fields in the database schema. - Ensure that new databases are created with a clean schema. - Validate that keywords are correctly extracted from the normalized file_keywords table. - Test symbol insertion without deprecated fields and subdir operations without direct_files. - Create a detailed search comparison test to evaluate vector search vs hybrid search performance. - Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality. - Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2026-02-12 02:37:45 +08:00 · 2025-12-16 19:27:05 +08:00
parent 3da0ef2adb
commit df23975a0b
61 changed files with 13114 additions and 366 deletions
--- a/codex-lens/docs/SEARCH_COMPARISON_ANALYSIS.md
+++ b/codex-lens/docs/SEARCH_COMPARISON_ANALYSIS.md
@@ -0,0 +1,711 @@
+# CodexLens 搜索模式对比分析报告
+
+**生成时间**: 2025-12-16
+**分析目标**: 对比向量搜索和混合搜索效果，诊断向量搜索返回空结果的原因，评估并行架构效能
+
+---
+
+## 执行摘要
+
+通过深入的代码分析和实验测试，我们发现了向量搜索在当前实现中的几个关键问题，并提供了针对性的优化方案。
+
+### 核心发现
+
+1. **向量搜索返回空结果的根本原因**：缺少向量嵌入数据（semantic_chunks表为空）
+2. **混合搜索架构设计优秀**：使用了双层并行架构，性能表现良好
+3. **向量搜索模式的语义问题**："vector模式"实际上总是包含exact搜索，不是纯向量搜索
+
+---
+
+## 1. 问题诊断
+
+### 1.1 向量索引数据库位置
+
+**存储架构**：
+- **位置**: 向量数据集成存储在SQLite索引文件中（`_index.db`）
+- **表名**: `semantic_chunks`
+- **字段结构**:
+  - `id`: 主键
+  - `file_path`: 文件路径
+  - `content`: 代码块内容
+  - `embedding`: 向量嵌入（BLOB格式，numpy float32数组）
+  - `metadata`: JSON格式元数据
+  - `created_at`: 创建时间
+
+**默认存储路径**：
+- 全局索引: `~/.codexlens/indexes/`
+- 项目索引: `项目目录/.codexlens/`
+- 每个目录一个 `_index.db` 文件
+
+**为什么没有看到向量数据库**：
+向量数据不是独立数据库，而是与FTS索引共存于同一个SQLite文件中的`semantic_chunks`表。如果该表不存在或为空，说明从未生成过向量嵌入。
+
+### 1.2 向量搜索返回空结果的原因
+
+**代码分析** (`hybrid_search.py:195-253`):
+
+```python
+def _search_vector(self, index_path: Path, query: str, limit: int) -> List[SearchResult]:
+    try:
+        # 检查1: semantic_chunks表是否存在
+        conn = sqlite3.connect(index_path)
+        cursor = conn.execute(
+            "SELECT name FROM sqlite_master WHERE type='table' AND name='semantic_chunks'"
+        )
+        has_semantic_table = cursor.fetchone() is not None
+        conn.close()
+
+        if not has_semantic_table:
+            self.logger.debug("No semantic_chunks table found")
+            return []  # ❌ 返回空列表
+
+        # 检查2: 向量存储是否有数据
+        vector_store = VectorStore(index_path)
+        if vector_store.count_chunks() == 0:
+            self.logger.debug("Vector store is empty")
+            return []  # ❌ 返回空列表
+
+        # 正常向量搜索流程...
+    except Exception as exc:
+        return []  # ❌ 异常也返回空列表
+```
+
+**失败路径**：
+1. `semantic_chunks`表不存在 → 返回空
+2. 表存在但无数据 → 返回空
+3. 语义搜索依赖未安装 → 返回空
+4. 任何异常 → 返回空
+
+**当前状态诊断**：
+通过测试验证，当前项目中：
+- ✗ `semantic_chunks`表不存在
+- ✗ 未执行向量嵌入生成流程
+- ✗ 向量索引从未创建
+
+**解决方案**：需要执行向量嵌入生成流程（见第3节）
+
+### 1.3 混合搜索 vs 向量搜索的实际行为
+
+**重要发现**：当前实现中，"vector模式"并非纯向量搜索。
+
+**代码证据** (`hybrid_search.py:72-77`):
+
+```python
+def search(self, ...):
+    # Determine which backends to use
+    backends = {"exact": True}  # ⚠️ exact搜索总是启用！
+    if enable_fuzzy:
+        backends["fuzzy"] = True
+    if enable_vector:
+        backends["vector"] = True
+```
+
+**影响**：
+- 即使设置为"vector模式"（`enable_fuzzy=False, enable_vector=True`），exact搜索仍然运行
+- 当向量搜索返回空时，RRF融合仍会包含exact搜索的结果
+- 这导致"向量搜索"在没有嵌入数据时仍返回结果（来自exact FTS）
+
+**测试验证**：
+```
+测试场景：有FTS索引但无向量嵌入
+查询："authentication"
+
+预期行为（纯向量模式）:
+  - 向量搜索: 0 结果（无嵌入数据）
+  - 最终结果: 0
+
+实际行为:
+  - 向量搜索: 0 结果
+  - Exact搜索: 3 结果 ✓ （总是运行）
+  - 最终结果: 3（来自exact，经过RRF）
+```
+
+**设计建议**：
+1. **选项A（推荐）**: 添加纯向量模式标志
+   ```python
+   backends = {}
+   if enable_vector and not pure_vector_mode:
+       backends["exact"] = True  # 向量搜索的后备方案
+   elif not enable_vector:
+       backends["exact"] = True  # 非向量模式总是启用exact
+   ```
+
+2. **选项B**: 文档明确说明当前行为
+   - "vector模式"实际是"vector+exact混合模式"
+   - 提供警告信息当向量搜索返回空时
+
+---
+
+## 2. 并行架构分析
+
+### 2.1 双层并行设计
+
+CodexLens采用了优秀的双层并行架构：
+
+**第一层：搜索方法级并行** (`HybridSearchEngine`)
+
+```python
+def _search_parallel(self, index_path, query, backends, limit):
+    with ThreadPoolExecutor(max_workers=len(backends)) as executor:
+        # 并行提交搜索任务
+        if backends.get("exact"):
+            future = executor.submit(self._search_exact, ...)
+        if backends.get("fuzzy"):
+            future = executor.submit(self._search_fuzzy, ...)
+        if backends.get("vector"):
+            future = executor.submit(self._search_vector, ...)
+
+        # 收集结果
+        for future in as_completed(future_to_source):
+            results = future.result()
+```
+
+**特点**：
+- 在**单个索引**内，exact/fuzzy/vector三种搜索方法并行执行
+- 使用`ThreadPoolExecutor`实现I/O密集型任务并行
+- 使用`as_completed`实现结果流式收集
+- 动态worker数量（与启用的backend数量相同）
+
+**性能测试结果**：
+```
+搜索模式    | 平均延迟  | 相对overhead
+-----------|----------|-------------
+Exact only | 5.6ms    | 1.0x (基线)
+Fuzzy only | 7.7ms    | 1.4x
+Vector only| 7.4ms    | 1.3x
+Hybrid (all)| 9.0ms   | 1.6x
+```
+
+**分析**：
+- ✓ Hybrid模式开销合理（<2x），证明并行有效
+- ✓ 单次搜索延迟仍保持在10ms以下（优秀）
+
+**第二层：索引级并行** (`ChainSearchEngine`)
+
+```python
+def _search_parallel(self, index_paths, query, options):
+    executor = self._get_executor(options.max_workers)
+
+    # 为每个索引提交搜索任务
+    future_to_path = {
+        executor.submit(
+            self._search_single_index,
+            idx_path, query, ...
+        ): idx_path
+        for idx_path in index_paths
+    }
+
+    # 收集所有索引的结果
+    for future in as_completed(future_to_path):
+        results = future.result()
+        all_results.extend(results)
+```
+
+**特点**：
+- 跨**多个目录索引**并行搜索
+- 共享线程池（避免线程创建开销）
+- 可配置worker数量（默认8）
+- 结果去重和RRF融合
+
+### 2.2 并行效能评估
+
+**优势**：
+1. ✓ **架构清晰**：双层并行职责明确，互不干扰
+2. ✓ **资源利用**：I/O密集型任务充分利用线程池
+3. ✓ **扩展性**：易于添加新的搜索后端
+4. ✓ **容错性**：单个后端失败不影响其他后端
+
+**当前利用率**：
+- 单索引搜索：并行度 = min(3, 启用的backend数量)
+- 多索引搜索：并行度 = min(8, 索引数量)
+- **充分发挥**：只要有多个索引或多个backend
+
+**潜在优化点**：
+1. **CPU密集型任务**：向量相似度计算已使用numpy向量化，无需额外并行
+2. **缓存优化**：`VectorStore`已实现embedding matrix缓存，性能良好
+3. **动态worker调度**：当前固定worker数，可根据任务负载动态调整
+
+---
+
+## 3. 解决方案与优化建议
+
+### 3.1 立即修复：生成向量嵌入
+
+**步骤1：安装语义搜索依赖**
+
+```bash
+# 方式A：完整安装
+pip install codexlens[semantic]
+
+# 方式B：手动安装依赖
+pip install fastembed numpy
+```
+
+**步骤2：创建向量索引脚本**
+
+保存为 `scripts/generate_embeddings.py`:
+
+```python
+"""Generate vector embeddings for existing indexes."""
+
+import logging
+import sqlite3
+from pathlib import Path
+
+from codexlens.semantic.embedder import Embedder
+from codexlens.semantic.vector_store import VectorStore
+from codexlens.semantic.chunker import Chunker, ChunkConfig
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def generate_embeddings_for_index(index_db_path: Path):
+    """Generate embeddings for all files in an index."""
+    logger.info(f"Processing index: {index_db_path}")
+
+    # Initialize components
+    embedder = Embedder(profile="code")  # Use code-optimized model
+    vector_store = VectorStore(index_db_path)
+    chunker = Chunker(config=ChunkConfig(max_chunk_size=2000))
+
+    # Read files from index
+    with sqlite3.connect(index_db_path) as conn:
+        conn.row_factory = sqlite3.Row
+        cursor = conn.execute("SELECT full_path, content, language FROM files")
+        files = cursor.fetchall()
+
+    logger.info(f"Found {len(files)} files to process")
+
+    # Process each file
+    total_chunks = 0
+    for file_row in files:
+        file_path = file_row["full_path"]
+        content = file_row["content"]
+        language = file_row["language"] or "python"
+
+        try:
+            # Create chunks
+            chunks = chunker.chunk_sliding_window(
+                content,
+                file_path=file_path,
+                language=language
+            )
+
+            if not chunks:
+                logger.debug(f"No chunks created for {file_path}")
+                continue
+
+            # Generate embeddings
+            for chunk in chunks:
+                embedding = embedder.embed_single(chunk.content)
+                chunk.embedding = embedding
+
+            # Store chunks
+            vector_store.add_chunks(chunks, file_path)
+            total_chunks += len(chunks)
+            logger.info(f"✓ {file_path}: {len(chunks)} chunks")
+
+        except Exception as exc:
+            logger.error(f"✗ {file_path}: {exc}")
+
+    logger.info(f"Completed: {total_chunks} total chunks indexed")
+    return total_chunks
+
+
+def main():
+    import sys
+
+    if len(sys.argv) < 2:
+        print("Usage: python generate_embeddings.py <index_db_path>")
+        print("Example: python generate_embeddings.py ~/.codexlens/indexes/project/_index.db")
+        sys.exit(1)
+
+    index_path = Path(sys.argv[1])
+
+    if not index_path.exists():
+        print(f"Error: Index not found at {index_path}")
+        sys.exit(1)
+
+    generate_embeddings_for_index(index_path)
+
+
+if __name__ == "__main__":
+    main()
+```
+
+**步骤3：执行生成**
+
+```bash
+# 为特定项目生成嵌入
+python scripts/generate_embeddings.py ~/.codexlens/indexes/codex-lens/_index.db
+
+# 或使用find批量处理
+find ~/.codexlens/indexes -name "_index.db" -type f | while read db; do
+    python scripts/generate_embeddings.py "$db"
+done
+```
+
+**步骤4：验证生成结果**
+
+```bash
+# 检查semantic_chunks表
+sqlite3 ~/.codexlens/indexes/codex-lens/_index.db \
+    "SELECT COUNT(*) as chunk_count FROM semantic_chunks"
+
+# 测试向量搜索
+codexlens search "authentication user credentials" \
+    --path ~/projects/codex-lens \
+    --mode vector
+```
+
+### 3.2 短期优化：改进向量搜索语义
+
+**问题**：当前"vector模式"实际包含exact搜索，语义不清晰
+
+**解决方案**：添加`pure_vector`参数
+
+**实现** (修改 `hybrid_search.py`):
+
+```python
+class HybridSearchEngine:
+    def search(
+        self,
+        index_path: Path,
+        query: str,
+        limit: int = 20,
+        enable_fuzzy: bool = True,
+        enable_vector: bool = False,
+        pure_vector: bool = False,  # 新增参数
+    ) -> List[SearchResult]:
+        """Execute hybrid search with parallel retrieval and RRF fusion.
+
+        Args:
+            ...
+            pure_vector: If True, only use vector search (no FTS fallback)
+        """
+        # Determine which backends to use
+        backends = {}
+
+        if pure_vector:
+            # 纯向量模式：只使用向量搜索
+            if enable_vector:
+                backends["vector"] = True
+        else:
+            # 混合模式：总是包含exact搜索作为基线
+            backends["exact"] = True
+            if enable_fuzzy:
+                backends["fuzzy"] = True
+            if enable_vector:
+                backends["vector"] = True
+
+        # ... rest of the method
+```
+
+**CLI更新** (修改 `commands.py`):
+
+```python
+@app.command()
+def search(
+    ...
+    mode: str = typer.Option("exact", "--mode", "-m",
+        help="Search mode: exact, fuzzy, hybrid, vector, pure-vector."),
+    ...
+):
+    """...
+    Search Modes:
+      - exact: Exact FTS
+      - fuzzy: Fuzzy FTS
+      - hybrid: RRF fusion of exact + fuzzy + vector (recommended)
+      - vector: Vector search with exact FTS fallback
+      - pure-vector: Pure semantic vector search (no FTS fallback)
+    """
+    ...
+
+    # Map mode to options
+    if mode == "exact":
+        hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, False, False, False
+    elif mode == "fuzzy":
+        hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, True, False, False
+    elif mode == "vector":
+        hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, False
+    elif mode == "pure-vector":
+        hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, True
+    elif mode == "hybrid":
+        hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, True, True, False
+```
+
+### 3.3 中期优化：增强向量搜索效果
+
+**优化1：改进分块策略**
+
+当前使用简单的滑动窗口，可优化为：
+
+```python
+class HybridChunker(Chunker):
+    """Hybrid chunking strategy combining symbol-based and sliding window."""
+
+    def chunk_hybrid(
+        self,
+        content: str,
+        symbols: List[Symbol],
+        file_path: str,
+        language: str,
+    ) -> List[SemanticChunk]:
+        """
+        1. 优先按symbol分块（函数、类级别）
+        2. 对过大symbol，进一步使用滑动窗口
+        3. 对symbol间隙，使用滑动窗口补充
+        """
+        chunks = []
+
+        # Step 1: Symbol-based chunks
+        symbol_chunks = self.chunk_by_symbol(content, symbols, file_path, language)
+
+        # Step 2: Split oversized symbols
+        for chunk in symbol_chunks:
+            if chunk.token_count > self.config.max_chunk_size:
+                # 使用滑动窗口进一步分割
+                sub_chunks = self._split_large_chunk(chunk)
+                chunks.extend(sub_chunks)
+            else:
+                chunks.append(chunk)
+
+        # Step 3: Fill gaps with sliding window
+        gap_chunks = self._chunk_gaps(content, symbols, file_path, language)
+        chunks.extend(gap_chunks)
+
+        return chunks
+```
+
+**优化2：添加查询扩展**
+
+```python
+class QueryExpander:
+    """Expand queries for better vector search recall."""
+
+    def expand(self, query: str) -> str:
+        """Expand query with synonyms and related terms."""
+        # 示例：代码领域同义词
+        expansions = {
+            "auth": ["authentication", "authorization", "login"],
+            "db": ["database", "storage", "repository"],
+            "api": ["endpoint", "route", "interface"],
+        }
+
+        terms = query.lower().split()
+        expanded = set(terms)
+
+        for term in terms:
+            if term in expansions:
+                expanded.update(expansions[term])
+
+        return " ".join(expanded)
+```
+
+**优化3：混合检索策略**
+
+```python
+class AdaptiveHybridSearch:
+    """Adaptive search strategy based on query type."""
+
+    def search(self, query: str, ...):
+        # 分析查询类型
+        query_type = self._classify_query(query)
+
+        if query_type == "keyword":
+            # 代码标识符查询 → 偏重FTS
+            weights = {"exact": 0.5, "fuzzy": 0.3, "vector": 0.2}
+        elif query_type == "semantic":
+            # 自然语言查询 → 偏重向量
+            weights = {"exact": 0.2, "fuzzy": 0.2, "vector": 0.6}
+        elif query_type == "hybrid":
+            # 混合查询 → 平衡权重
+            weights = {"exact": 0.4, "fuzzy": 0.3, "vector": 0.3}
+
+        return self.engine.search(query, weights=weights, ...)
+```
+
+### 3.4 长期优化：性能与质量提升
+
+**优化1：增量嵌入更新**
+
+```python
+class IncrementalEmbeddingUpdater:
+    """Update embeddings incrementally for changed files."""
+
+    def update_for_file(self, file_path: str, new_content: str):
+        """Only regenerate embeddings for changed file."""
+        # 1. 删除旧嵌入
+        self.vector_store.delete_file_chunks(file_path)
+
+        # 2. 生成新嵌入
+        chunks = self.chunker.chunk(new_content, ...)
+        for chunk in chunks:
+            chunk.embedding = self.embedder.embed_single(chunk.content)
+
+        # 3. 存储新嵌入
+        self.vector_store.add_chunks(chunks, file_path)
+```
+
+**优化2：向量索引压缩**
+
+```python
+# 使用量化技术减少存储空间（768维 → 192维）
+from qdrant_client import models
+
+# 产品量化（PQ）压缩
+compressed_vector = pq_quantize(embedding, target_dim=192)
+```
+
+**优化3：向量搜索加速**
+
+```python
+# 使用FAISS或Hnswlib替代numpy暴力搜索
+import faiss
+
+class FAISSVectorStore(VectorStore):
+    def __init__(self, db_path, dim=768):
+        super().__init__(db_path)
+        # 使用HNSW索引
+        self.index = faiss.IndexHNSWFlat(dim, 32)
+        self._load_vectors_to_index()
+
+    def search_similar(self, query_embedding, top_k=10):
+        # FAISS加速搜索（100x+）
+        scores, indices = self.index.search(
+            np.array([query_embedding]), top_k
+        )
+        return self._fetch_by_indices(indices[0], scores[0])
+```
+
+---
+
+## 4. 对比总结
+
+### 4.1 搜索模式对比
+
+| 维度 | Exact FTS | Fuzzy FTS | Vector Search | Hybrid (推荐) |
+|------|-----------|-----------|---------------|--------------|
+| **匹配类型** | 精确词匹配 | 容错匹配 | 语义相似 | 多模式融合 |
+| **查询类型** | 标识符、关键词 | 拼写错误容忍 | 自然语言 | 所有类型 |
+| **召回率** | 中 | 高 | 最高 | 最高 |
+| **精确率** | 高 | 中 | 中 | 高 |
+| **延迟** | 5-7ms | 7-9ms | 7-10ms | 9-11ms |
+| **依赖** | 仅SQLite | 仅SQLite | fastembed+numpy | 全部 |
+| **存储开销** | 小（FTS索引） | 小（FTS索引） | 大（向量） | 大（FTS+向量） |
+| **适用场景** | 代码搜索 | 容错搜索 | 概念搜索 | 通用搜索 |
+
+### 4.2 推荐使用策略
+
+**场景1：代码标识符搜索**（函数名、类名、变量名）
+```bash
+codexlens search "authenticate_user" --mode exact
+```
+→ 使用exact模式，最快且最精确
+
+**场景2：概念性搜索**（"如何验证用户身份"）
+```bash
+codexlens search "how to verify user credentials" --mode hybrid
+```
+→ 使用hybrid模式，结合语义和关键词
+
+**场景3：容错搜索**（允许拼写错误）
+```bash
+codexlens search "autheticate" --mode fuzzy
+```
+→ 使用fuzzy模式，trigram容错
+
+**场景4：纯语义搜索**（需先生成嵌入）
+```bash
+codexlens search "password encryption with salt" --mode pure-vector
+```
+→ 使用pure-vector模式，理解语义意图
+
+---
+
+## 5. 实施检查清单
+
+### 立即行动项 (P0)
+
+- [ ] 安装语义搜索依赖：`pip install codexlens[semantic]`
+- [ ] 运行嵌入生成脚本（见3.1节）
+- [ ] 验证semantic_chunks表已创建且有数据
+- [ ] 测试vector模式搜索是否返回结果
+
+### 短期改进 (P1)
+
+- [ ] 添加pure_vector参数（见3.2节）
+- [ ] 更新CLI支持pure-vector模式
+- [ ] 添加嵌入生成进度提示
+- [ ] 文档更新：搜索模式使用指南
+
+### 中期优化 (P2)
+
+- [ ] 实现混合分块策略（见3.3节）
+- [ ] 添加查询扩展功能
+- [ ] 实现自适应权重调整
+- [ ] 性能基准测试
+
+### 长期规划 (P3)
+
+- [ ] 增量嵌入更新机制
+- [ ] 向量索引压缩
+- [ ] 集成FAISS加速
+- [ ] 多模态搜索（代码+文档）
+
+---
+
+## 6. 参考资源
+
+### 代码文件
+
+- 混合搜索引擎: `codex-lens/src/codexlens/search/hybrid_search.py`
+- 向量存储: `codex-lens/src/codexlens/semantic/vector_store.py`
+- 向量嵌入: `codex-lens/src/codexlens/semantic/embedder.py`
+- 代码分块: `codex-lens/src/codexlens/semantic/chunker.py`
+- 链式搜索: `codex-lens/src/codexlens/search/chain_search.py`
+
+### 测试文件
+
+- 对比测试: `codex-lens/tests/test_search_comparison.py`
+- 混合搜索E2E: `codex-lens/tests/test_hybrid_search_e2e.py`
+- CLI测试: `codex-lens/tests/test_cli_hybrid_search.py`
+
+### 相关文档
+
+- RRF算法: `codex-lens/src/codexlens/search/ranking.py`
+- 查询解析: `codex-lens/src/codexlens/search/query_parser.py`
+- 配置管理: `codex-lens/src/codexlens/config.py`
+
+---
+
+## 7. 结论
+
+通过本次深入分析，我们明确了CodexLens搜索系统的优势和待优化点：
+
+**优势**：
+1. ✓ 优秀的并行架构设计（双层并行）
+2. ✓ RRF融合算法实现合理
+3. ✓ 向量存储实现高效（numpy向量化+缓存）
+4. ✓ 模块化设计，易于扩展
+
+**待优化**：
+1. 向量嵌入生成流程需要手动触发
+2. "vector模式"语义不清晰（实际包含exact搜索）
+3. 分块策略可以优化（混合策略）
+4. 缺少增量更新机制
+
+**核心建议**：
+1. **立即**: 生成向量嵌入，解决返回空结果问题
+2. **短期**: 添加纯向量模式，澄清语义
+3. **中期**: 优化分块和查询策略，提升搜索质量
+4. **长期**: 性能优化和高级特性
+
+通过实施这些改进，CodexLens的搜索功能将达到生产级别的质量和性能标准。
+
+---
+
+**报告完成时间**: 2025-12-16
+**分析工具**: 代码静态分析 + 实验测试 + 性能测评
+**下一步**: 实施P0优先级改进项