Files
Claude-Code-Workflow/codex-lens/docs/SEARCH_COMPARISON_ANALYSIS.md
catlog22 df23975a0b Add comprehensive tests for schema cleanup migration and search comparison
- Implement tests for migration 005 to verify removal of deprecated fields in the database schema.
- Ensure that new databases are created with a clean schema.
- Validate that keywords are correctly extracted from the normalized file_keywords table.
- Test symbol insertion without deprecated fields and subdir operations without direct_files.
- Create a detailed search comparison test to evaluate vector search vs hybrid search performance.
- Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality.
- Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2025-12-16 19:27:05 +08:00

712 lines
21 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CodexLens 搜索模式对比分析报告
**生成时间**: 2025-12-16
**分析目标**: 对比向量搜索和混合搜索效果,诊断向量搜索返回空结果的原因,评估并行架构效能
---
## 执行摘要
通过深入的代码分析和实验测试,我们发现了向量搜索在当前实现中的几个关键问题,并提供了针对性的优化方案。
### 核心发现
1. **向量搜索返回空结果的根本原因**缺少向量嵌入数据semantic_chunks表为空
2. **混合搜索架构设计优秀**:使用了双层并行架构,性能表现良好
3. **向量搜索模式的语义问题**"vector模式"实际上总是包含exact搜索不是纯向量搜索
---
## 1. 问题诊断
### 1.1 向量索引数据库位置
**存储架构**
- **位置**: 向量数据集成存储在SQLite索引文件中`_index.db`
- **表名**: `semantic_chunks`
- **字段结构**:
- `id`: 主键
- `file_path`: 文件路径
- `content`: 代码块内容
- `embedding`: 向量嵌入BLOB格式numpy float32数组
- `metadata`: JSON格式元数据
- `created_at`: 创建时间
**默认存储路径**
- 全局索引: `~/.codexlens/indexes/`
- 项目索引: `项目目录/.codexlens/`
- 每个目录一个 `_index.db` 文件
**为什么没有看到向量数据库**
向量数据不是独立数据库而是与FTS索引共存于同一个SQLite文件中的`semantic_chunks`表。如果该表不存在或为空,说明从未生成过向量嵌入。
### 1.2 向量搜索返回空结果的原因
**代码分析** (`hybrid_search.py:195-253`):
```python
def _search_vector(self, index_path: Path, query: str, limit: int) -> List[SearchResult]:
try:
# 检查1: semantic_chunks表是否存在
conn = sqlite3.connect(index_path)
cursor = conn.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name='semantic_chunks'"
)
has_semantic_table = cursor.fetchone() is not None
conn.close()
if not has_semantic_table:
self.logger.debug("No semantic_chunks table found")
return [] # ❌ 返回空列表
# 检查2: 向量存储是否有数据
vector_store = VectorStore(index_path)
if vector_store.count_chunks() == 0:
self.logger.debug("Vector store is empty")
return [] # ❌ 返回空列表
# 正常向量搜索流程...
except Exception as exc:
return [] # ❌ 异常也返回空列表
```
**失败路径**
1. `semantic_chunks`表不存在 → 返回空
2. 表存在但无数据 → 返回空
3. 语义搜索依赖未安装 → 返回空
4. 任何异常 → 返回空
**当前状态诊断**
通过测试验证,当前项目中:
-`semantic_chunks`表不存在
- ✗ 未执行向量嵌入生成流程
- ✗ 向量索引从未创建
**解决方案**需要执行向量嵌入生成流程见第3节
### 1.3 混合搜索 vs 向量搜索的实际行为
**重要发现**:当前实现中,"vector模式"并非纯向量搜索。
**代码证据** (`hybrid_search.py:72-77`):
```python
def search(self, ...):
# Determine which backends to use
backends = {"exact": True} # ⚠️ exact搜索总是启用
if enable_fuzzy:
backends["fuzzy"] = True
if enable_vector:
backends["vector"] = True
```
**影响**
- 即使设置为"vector模式"`enable_fuzzy=False, enable_vector=True`exact搜索仍然运行
- 当向量搜索返回空时RRF融合仍会包含exact搜索的结果
- 这导致"向量搜索"在没有嵌入数据时仍返回结果来自exact FTS
**测试验证**
```
测试场景有FTS索引但无向量嵌入
查询:"authentication"
预期行为(纯向量模式):
- 向量搜索: 0 结果(无嵌入数据)
- 最终结果: 0
实际行为:
- 向量搜索: 0 结果
- Exact搜索: 3 结果 ✓ (总是运行)
- 最终结果: 3来自exact经过RRF
```
**设计建议**
1. **选项A推荐**: 添加纯向量模式标志
```python
backends = {}
if enable_vector and not pure_vector_mode:
backends["exact"] = True # 向量搜索的后备方案
elif not enable_vector:
backends["exact"] = True # 非向量模式总是启用exact
```
2. **选项B**: 文档明确说明当前行为
- "vector模式"实际是"vector+exact混合模式"
- 提供警告信息当向量搜索返回空时
---
## 2. 并行架构分析
### 2.1 双层并行设计
CodexLens采用了优秀的双层并行架构
**第一层:搜索方法级并行** (`HybridSearchEngine`)
```python
def _search_parallel(self, index_path, query, backends, limit):
with ThreadPoolExecutor(max_workers=len(backends)) as executor:
# 并行提交搜索任务
if backends.get("exact"):
future = executor.submit(self._search_exact, ...)
if backends.get("fuzzy"):
future = executor.submit(self._search_fuzzy, ...)
if backends.get("vector"):
future = executor.submit(self._search_vector, ...)
# 收集结果
for future in as_completed(future_to_source):
results = future.result()
```
**特点**
- 在**单个索引**内exact/fuzzy/vector三种搜索方法并行执行
- 使用`ThreadPoolExecutor`实现I/O密集型任务并行
- 使用`as_completed`实现结果流式收集
- 动态worker数量与启用的backend数量相同
**性能测试结果**
```
搜索模式 | 平均延迟 | 相对overhead
-----------|----------|-------------
Exact only | 5.6ms | 1.0x (基线)
Fuzzy only | 7.7ms | 1.4x
Vector only| 7.4ms | 1.3x
Hybrid (all)| 9.0ms | 1.6x
```
**分析**
- ✓ Hybrid模式开销合理<2x证明并行有效
- ✓ 单次搜索延迟仍保持在10ms以下优秀
**第二层:索引级并行** (`ChainSearchEngine`)
```python
def _search_parallel(self, index_paths, query, options):
executor = self._get_executor(options.max_workers)
# 为每个索引提交搜索任务
future_to_path = {
executor.submit(
self._search_single_index,
idx_path, query, ...
): idx_path
for idx_path in index_paths
}
# 收集所有索引的结果
for future in as_completed(future_to_path):
results = future.result()
all_results.extend(results)
```
**特点**
- 跨**多个目录索引**并行搜索
- 共享线程池(避免线程创建开销)
- 可配置worker数量默认8
- 结果去重和RRF融合
### 2.2 并行效能评估
**优势**
1. ✓ **架构清晰**:双层并行职责明确,互不干扰
2. ✓ **资源利用**I/O密集型任务充分利用线程池
3. ✓ **扩展性**:易于添加新的搜索后端
4. ✓ **容错性**:单个后端失败不影响其他后端
**当前利用率**
- 单索引搜索:并行度 = min(3, 启用的backend数量)
- 多索引搜索:并行度 = min(8, 索引数量)
- **充分发挥**只要有多个索引或多个backend
**潜在优化点**
1. **CPU密集型任务**向量相似度计算已使用numpy向量化无需额外并行
2. **缓存优化**`VectorStore`已实现embedding matrix缓存性能良好
3. **动态worker调度**当前固定worker数可根据任务负载动态调整
---
## 3. 解决方案与优化建议
### 3.1 立即修复:生成向量嵌入
**步骤1安装语义搜索依赖**
```bash
# 方式A完整安装
pip install codexlens[semantic]
# 方式B手动安装依赖
pip install fastembed numpy
```
**步骤2创建向量索引脚本**
保存为 `scripts/generate_embeddings.py`:
```python
"""Generate vector embeddings for existing indexes."""
import logging
import sqlite3
from pathlib import Path
from codexlens.semantic.embedder import Embedder
from codexlens.semantic.vector_store import VectorStore
from codexlens.semantic.chunker import Chunker, ChunkConfig
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def generate_embeddings_for_index(index_db_path: Path):
"""Generate embeddings for all files in an index."""
logger.info(f"Processing index: {index_db_path}")
# Initialize components
embedder = Embedder(profile="code") # Use code-optimized model
vector_store = VectorStore(index_db_path)
chunker = Chunker(config=ChunkConfig(max_chunk_size=2000))
# Read files from index
with sqlite3.connect(index_db_path) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT full_path, content, language FROM files")
files = cursor.fetchall()
logger.info(f"Found {len(files)} files to process")
# Process each file
total_chunks = 0
for file_row in files:
file_path = file_row["full_path"]
content = file_row["content"]
language = file_row["language"] or "python"
try:
# Create chunks
chunks = chunker.chunk_sliding_window(
content,
file_path=file_path,
language=language
)
if not chunks:
logger.debug(f"No chunks created for {file_path}")
continue
# Generate embeddings
for chunk in chunks:
embedding = embedder.embed_single(chunk.content)
chunk.embedding = embedding
# Store chunks
vector_store.add_chunks(chunks, file_path)
total_chunks += len(chunks)
logger.info(f"✓ {file_path}: {len(chunks)} chunks")
except Exception as exc:
logger.error(f"✗ {file_path}: {exc}")
logger.info(f"Completed: {total_chunks} total chunks indexed")
return total_chunks
def main():
import sys
if len(sys.argv) < 2:
print("Usage: python generate_embeddings.py <index_db_path>")
print("Example: python generate_embeddings.py ~/.codexlens/indexes/project/_index.db")
sys.exit(1)
index_path = Path(sys.argv[1])
if not index_path.exists():
print(f"Error: Index not found at {index_path}")
sys.exit(1)
generate_embeddings_for_index(index_path)
if __name__ == "__main__":
main()
```
**步骤3执行生成**
```bash
# 为特定项目生成嵌入
python scripts/generate_embeddings.py ~/.codexlens/indexes/codex-lens/_index.db
# 或使用find批量处理
find ~/.codexlens/indexes -name "_index.db" -type f | while read db; do
python scripts/generate_embeddings.py "$db"
done
```
**步骤4验证生成结果**
```bash
# 检查semantic_chunks表
sqlite3 ~/.codexlens/indexes/codex-lens/_index.db \
"SELECT COUNT(*) as chunk_count FROM semantic_chunks"
# 测试向量搜索
codexlens search "authentication user credentials" \
--path ~/projects/codex-lens \
--mode vector
```
### 3.2 短期优化:改进向量搜索语义
**问题**:当前"vector模式"实际包含exact搜索语义不清晰
**解决方案**:添加`pure_vector`参数
**实现** (修改 `hybrid_search.py`):
```python
class HybridSearchEngine:
def search(
self,
index_path: Path,
query: str,
limit: int = 20,
enable_fuzzy: bool = True,
enable_vector: bool = False,
pure_vector: bool = False, # 新增参数
) -> List[SearchResult]:
"""Execute hybrid search with parallel retrieval and RRF fusion.
Args:
...
pure_vector: If True, only use vector search (no FTS fallback)
"""
# Determine which backends to use
backends = {}
if pure_vector:
# 纯向量模式:只使用向量搜索
if enable_vector:
backends["vector"] = True
else:
# 混合模式总是包含exact搜索作为基线
backends["exact"] = True
if enable_fuzzy:
backends["fuzzy"] = True
if enable_vector:
backends["vector"] = True
# ... rest of the method
```
**CLI更新** (修改 `commands.py`):
```python
@app.command()
def search(
...
mode: str = typer.Option("exact", "--mode", "-m",
help="Search mode: exact, fuzzy, hybrid, vector, pure-vector."),
...
):
"""...
Search Modes:
- exact: Exact FTS
- fuzzy: Fuzzy FTS
- hybrid: RRF fusion of exact + fuzzy + vector (recommended)
- vector: Vector search with exact FTS fallback
- pure-vector: Pure semantic vector search (no FTS fallback)
"""
...
# Map mode to options
if mode == "exact":
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, False, False, False
elif mode == "fuzzy":
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, True, False, False
elif mode == "vector":
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, False
elif mode == "pure-vector":
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, True
elif mode == "hybrid":
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, True, True, False
```
### 3.3 中期优化:增强向量搜索效果
**优化1改进分块策略**
当前使用简单的滑动窗口,可优化为:
```python
class HybridChunker(Chunker):
"""Hybrid chunking strategy combining symbol-based and sliding window."""
def chunk_hybrid(
self,
content: str,
symbols: List[Symbol],
file_path: str,
language: str,
) -> List[SemanticChunk]:
"""
1. 优先按symbol分块函数、类级别
2. 对过大symbol进一步使用滑动窗口
3. 对symbol间隙使用滑动窗口补充
"""
chunks = []
# Step 1: Symbol-based chunks
symbol_chunks = self.chunk_by_symbol(content, symbols, file_path, language)
# Step 2: Split oversized symbols
for chunk in symbol_chunks:
if chunk.token_count > self.config.max_chunk_size:
# 使用滑动窗口进一步分割
sub_chunks = self._split_large_chunk(chunk)
chunks.extend(sub_chunks)
else:
chunks.append(chunk)
# Step 3: Fill gaps with sliding window
gap_chunks = self._chunk_gaps(content, symbols, file_path, language)
chunks.extend(gap_chunks)
return chunks
```
**优化2添加查询扩展**
```python
class QueryExpander:
"""Expand queries for better vector search recall."""
def expand(self, query: str) -> str:
"""Expand query with synonyms and related terms."""
# 示例:代码领域同义词
expansions = {
"auth": ["authentication", "authorization", "login"],
"db": ["database", "storage", "repository"],
"api": ["endpoint", "route", "interface"],
}
terms = query.lower().split()
expanded = set(terms)
for term in terms:
if term in expansions:
expanded.update(expansions[term])
return " ".join(expanded)
```
**优化3混合检索策略**
```python
class AdaptiveHybridSearch:
"""Adaptive search strategy based on query type."""
def search(self, query: str, ...):
# 分析查询类型
query_type = self._classify_query(query)
if query_type == "keyword":
# 代码标识符查询 → 偏重FTS
weights = {"exact": 0.5, "fuzzy": 0.3, "vector": 0.2}
elif query_type == "semantic":
# 自然语言查询 → 偏重向量
weights = {"exact": 0.2, "fuzzy": 0.2, "vector": 0.6}
elif query_type == "hybrid":
# 混合查询 → 平衡权重
weights = {"exact": 0.4, "fuzzy": 0.3, "vector": 0.3}
return self.engine.search(query, weights=weights, ...)
```
### 3.4 长期优化:性能与质量提升
**优化1增量嵌入更新**
```python
class IncrementalEmbeddingUpdater:
"""Update embeddings incrementally for changed files."""
def update_for_file(self, file_path: str, new_content: str):
"""Only regenerate embeddings for changed file."""
# 1. 删除旧嵌入
self.vector_store.delete_file_chunks(file_path)
# 2. 生成新嵌入
chunks = self.chunker.chunk(new_content, ...)
for chunk in chunks:
chunk.embedding = self.embedder.embed_single(chunk.content)
# 3. 存储新嵌入
self.vector_store.add_chunks(chunks, file_path)
```
**优化2向量索引压缩**
```python
# 使用量化技术减少存储空间768维 → 192维
from qdrant_client import models
# 产品量化PQ压缩
compressed_vector = pq_quantize(embedding, target_dim=192)
```
**优化3向量搜索加速**
```python
# 使用FAISS或Hnswlib替代numpy暴力搜索
import faiss
class FAISSVectorStore(VectorStore):
def __init__(self, db_path, dim=768):
super().__init__(db_path)
# 使用HNSW索引
self.index = faiss.IndexHNSWFlat(dim, 32)
self._load_vectors_to_index()
def search_similar(self, query_embedding, top_k=10):
# FAISS加速搜索100x+
scores, indices = self.index.search(
np.array([query_embedding]), top_k
)
return self._fetch_by_indices(indices[0], scores[0])
```
---
## 4. 对比总结
### 4.1 搜索模式对比
| 维度 | Exact FTS | Fuzzy FTS | Vector Search | Hybrid (推荐) |
|------|-----------|-----------|---------------|--------------|
| **匹配类型** | 精确词匹配 | 容错匹配 | 语义相似 | 多模式融合 |
| **查询类型** | 标识符、关键词 | 拼写错误容忍 | 自然语言 | 所有类型 |
| **召回率** | 中 | 高 | 最高 | 最高 |
| **精确率** | 高 | 中 | 中 | 高 |
| **延迟** | 5-7ms | 7-9ms | 7-10ms | 9-11ms |
| **依赖** | 仅SQLite | 仅SQLite | fastembed+numpy | 全部 |
| **存储开销** | 小FTS索引 | 小FTS索引 | 大(向量) | 大FTS+向量) |
| **适用场景** | 代码搜索 | 容错搜索 | 概念搜索 | 通用搜索 |
### 4.2 推荐使用策略
**场景1代码标识符搜索**(函数名、类名、变量名)
```bash
codexlens search "authenticate_user" --mode exact
```
→ 使用exact模式最快且最精确
**场景2概念性搜索**"如何验证用户身份"
```bash
codexlens search "how to verify user credentials" --mode hybrid
```
→ 使用hybrid模式结合语义和关键词
**场景3容错搜索**(允许拼写错误)
```bash
codexlens search "autheticate" --mode fuzzy
```
→ 使用fuzzy模式trigram容错
**场景4纯语义搜索**(需先生成嵌入)
```bash
codexlens search "password encryption with salt" --mode pure-vector
```
→ 使用pure-vector模式理解语义意图
---
## 5. 实施检查清单
### 立即行动项 (P0)
- [ ] 安装语义搜索依赖:`pip install codexlens[semantic]`
- [ ] 运行嵌入生成脚本见3.1节)
- [ ] 验证semantic_chunks表已创建且有数据
- [ ] 测试vector模式搜索是否返回结果
### 短期改进 (P1)
- [ ] 添加pure_vector参数见3.2节)
- [ ] 更新CLI支持pure-vector模式
- [ ] 添加嵌入生成进度提示
- [ ] 文档更新:搜索模式使用指南
### 中期优化 (P2)
- [ ] 实现混合分块策略见3.3节)
- [ ] 添加查询扩展功能
- [ ] 实现自适应权重调整
- [ ] 性能基准测试
### 长期规划 (P3)
- [ ] 增量嵌入更新机制
- [ ] 向量索引压缩
- [ ] 集成FAISS加速
- [ ] 多模态搜索(代码+文档)
---
## 6. 参考资源
### 代码文件
- 混合搜索引擎: `codex-lens/src/codexlens/search/hybrid_search.py`
- 向量存储: `codex-lens/src/codexlens/semantic/vector_store.py`
- 向量嵌入: `codex-lens/src/codexlens/semantic/embedder.py`
- 代码分块: `codex-lens/src/codexlens/semantic/chunker.py`
- 链式搜索: `codex-lens/src/codexlens/search/chain_search.py`
### 测试文件
- 对比测试: `codex-lens/tests/test_search_comparison.py`
- 混合搜索E2E: `codex-lens/tests/test_hybrid_search_e2e.py`
- CLI测试: `codex-lens/tests/test_cli_hybrid_search.py`
### 相关文档
- RRF算法: `codex-lens/src/codexlens/search/ranking.py`
- 查询解析: `codex-lens/src/codexlens/search/query_parser.py`
- 配置管理: `codex-lens/src/codexlens/config.py`
---
## 7. 结论
通过本次深入分析我们明确了CodexLens搜索系统的优势和待优化点
**优势**
1. ✓ 优秀的并行架构设计(双层并行)
2. ✓ RRF融合算法实现合理
3. ✓ 向量存储实现高效numpy向量化+缓存)
4. ✓ 模块化设计,易于扩展
**待优化**
1. 向量嵌入生成流程需要手动触发
2. "vector模式"语义不清晰实际包含exact搜索
3. 分块策略可以优化(混合策略)
4. 缺少增量更新机制
**核心建议**
1. **立即**: 生成向量嵌入,解决返回空结果问题
2. **短期**: 添加纯向量模式,澄清语义
3. **中期**: 优化分块和查询策略,提升搜索质量
4. **长期**: 性能优化和高级特性
通过实施这些改进CodexLens的搜索功能将达到生产级别的质量和性能标准。
---
**报告完成时间**: 2025-12-16
**分析工具**: 代码静态分析 + 实验测试 + 性能测评
**下一步**: 实施P0优先级改进项