Add comprehensive tests for schema cleanup migration and search comparison

- Implement tests for migration 005 to verify removal of deprecated fields in the database schema. - Ensure that new databases are created with a clean schema. - Validate that keywords are correctly extracted from the normalized file_keywords table. - Test symbol insertion without deprecated fields and subdir operations without direct_files. - Create a detailed search comparison test to evaluate vector search vs hybrid search performance. - Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality. - Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2026-03-30 20:21:09 +08:00 · 2025-12-16 19:27:05 +08:00
parent 3da0ef2adb
commit df23975a0b
61 changed files with 13114 additions and 366 deletions
--- a/codex-lens/docs/SEARCH_ANALYSIS_SUMMARY.md
+++ b/codex-lens/docs/SEARCH_ANALYSIS_SUMMARY.md
@@ -0,0 +1,192 @@
+# CodexLens 搜索分析 - 执行摘要
+
+## 🎯 核心发现
+
+### 问题1：向量搜索为什么返回空结果？
+
+**根本原因**：向量嵌入数据不存在
+
+- ✗ `semantic_chunks` 表未创建
+- ✗ 从未执行向量嵌入生成流程
+- ✗ 向量索引数据库实际是 SQLite 中的一个表，不是独立文件
+
+**位置**：向量数据存储在 `~/.codexlens/indexes/项目名/_index.db` 的 `semantic_chunks` 表中
+
+### 问题2：向量索引数据库在哪里？
+
+**存储架构**：
+```
+~/.codexlens/indexes/
+└── project-name/
+    └── _index.db          ← SQLite数据库
+        ├── files          ← 文件索引表
+        ├── files_fts      ← FTS5全文索引
+        ├── files_fts_fuzzy ← 模糊搜索索引
+        └── semantic_chunks ← 向量嵌入表（当前不存在！）
+```
+
+**不是独立数据库**：向量数据集成在 SQLite 索引文件中，而不是单独的向量数据库。
+
+### 问题3：当前架构是否发挥了并行效果？
+
+**✓ 是的！架构非常优秀**
+
+- **双层并行**：
+  - 第1层：单索引内，exact/fuzzy/vector 三种搜索方法并行
+  - 第2层：跨多个目录索引并行搜索
+- **性能表现**：混合模式仅增加 1.6x 开销（9ms vs 5.6ms）
+- **资源利用**：ThreadPoolExecutor 充分利用 I/O 并发
+
+## ⚡ 快速修复
+
+### 立即解决向量搜索问题
+
+**步骤1：安装依赖**
+```bash
+pip install codexlens[semantic]
+# 或
+pip install fastembed numpy
+```
+
+**步骤2：生成向量嵌入**
+
+创建脚本 `generate_embeddings.py`:
+```python
+from pathlib import Path
+from codexlens.semantic.embedder import Embedder
+from codexlens.semantic.vector_store import VectorStore
+from codexlens.semantic.chunker import Chunker, ChunkConfig
+import sqlite3
+
+def generate_embeddings(index_db_path: Path):
+    embedder = Embedder(profile="code")
+    vector_store = VectorStore(index_db_path)
+    chunker = Chunker(config=ChunkConfig(max_chunk_size=2000))
+
+    with sqlite3.connect(index_db_path) as conn:
+        conn.row_factory = sqlite3.Row
+        files = conn.execute("SELECT full_path, content FROM files").fetchall()
+
+    for file_row in files:
+        chunks = chunker.chunk_sliding_window(
+            file_row["content"],
+            file_path=file_row["full_path"],
+            language="python"
+        )
+        for chunk in chunks:
+            chunk.embedding = embedder.embed_single(chunk.content)
+        if chunks:
+            vector_store.add_chunks(chunks, file_row["full_path"])
+```
+
+**步骤3：执行生成**
+```bash
+python generate_embeddings.py ~/.codexlens/indexes/codex-lens/_index.db
+```
+
+**步骤4：验证**
+```bash
+# 检查数据
+sqlite3 ~/.codexlens/indexes/codex-lens/_index.db \
+    "SELECT COUNT(*) FROM semantic_chunks"
+
+# 测试搜索
+codexlens search "authentication credentials" --mode vector
+```
+
+## 🔍 关键洞察
+
+### 发现：Vector模式不是纯向量搜索
+
+**当前行为**：
+```python
+# hybrid_search.py:73
+backends = {"exact": True}  # ⚠️ exact搜索总是启用！
+if enable_vector:
+    backends["vector"] = True
+```
+
+**影响**：
+- "vector模式"实际是 **vector + exact 混合模式**
+- 即使向量搜索返回空，仍有exact FTS结果
+- 这就是为什么"向量搜索"在无嵌入时也有结果
+
+**建议修复**：添加 `pure_vector` 参数以支持真正的纯向量搜索
+
+## 📊 搜索模式对比
+
+| 模式 | 延迟 | 召回率 | 适用场景 | 需要嵌入 |
+|------|------|--------|----------|---------|
+| **exact** | 5.6ms | 中 | 代码标识符 | ✗ |
+| **fuzzy** | 7.7ms | 高 | 容错搜索 | ✗ |
+| **vector** | 7.4ms | 最高 | 语义搜索 | ✓ |
+| **hybrid** | 9.0ms | 最高 | 通用搜索 | ✓ |
+
+**推荐**：
+- 代码搜索 → `--mode exact`
+- 自然语言 → `--mode hybrid`（需先生成嵌入）
+- 容错搜索 → `--mode fuzzy`
+
+## 📈 优化路线图
+
+### P0 - 立即 (本周)
+- [x] 生成向量嵌入
+- [ ] 验证向量搜索可用
+- [ ] 更新使用文档
+
+### P1 - 短期 (2周)
+- [ ] 添加 `pure_vector` 模式
+- [ ] 增量嵌入更新
+- [ ] 改进错误提示
+
+### P2 - 中期 (1-2月)
+- [ ] 混合分块策略
+- [ ] 查询扩展
+- [ ] 自适应权重
+
+### P3 - 长期 (3-6月)
+- [ ] FAISS加速
+- [ ] 向量压缩
+- [ ] 多模态搜索
+
+## 📚 详细文档
+
+完整分析报告：`SEARCH_COMPARISON_ANALYSIS.md`
+
+包含内容：
+- 详细问题诊断
+- 架构深度分析
+- 完整解决方案
+- 代码示例
+- 实施检查清单
+
+## 🎓 学习要点
+
+1. **向量搜索需要主动生成嵌入**：不会自动创建
+2. **双层并行架构很优秀**：无需额外优化
+3. **RRF融合算法工作良好**：多源结果合理融合
+4. **Vector模式非纯向量**：包含FTS作为后备
+
+## 💡 下一步行动
+
+```bash
+# 1. 安装依赖
+pip install codexlens[semantic]
+
+# 2. 创建索引（如果还没有）
+codexlens init ~/projects/your-project
+
+# 3. 生成嵌入
+python generate_embeddings.py ~/.codexlens/indexes/your-project/_index.db
+
+# 4. 测试搜索
+codexlens search "your natural language query" --mode hybrid
+```
+
+---
+
+**问题解决**: ✓ 已识别并提供解决方案
+**架构评估**: ✓ 并行架构优秀，充分发挥效能
+**优化建议**: ✓ 提供短期、中期、长期优化路线
+
+**联系**: 详见 `SEARCH_COMPARISON_ANALYSIS.md` 获取完整技术细节