mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-12 02:37:45 +08:00
Add comprehensive tests for schema cleanup migration and search comparison
- Implement tests for migration 005 to verify removal of deprecated fields in the database schema. - Ensure that new databases are created with a clean schema. - Validate that keywords are correctly extracted from the normalized file_keywords table. - Test symbol insertion without deprecated fields and subdir operations without direct_files. - Create a detailed search comparison test to evaluate vector search vs hybrid search performance. - Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality. - Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
This commit is contained in:
711
codex-lens/docs/SEARCH_COMPARISON_ANALYSIS.md
Normal file
711
codex-lens/docs/SEARCH_COMPARISON_ANALYSIS.md
Normal file
@@ -0,0 +1,711 @@
|
||||
# CodexLens 搜索模式对比分析报告
|
||||
|
||||
**生成时间**: 2025-12-16
|
||||
**分析目标**: 对比向量搜索和混合搜索效果,诊断向量搜索返回空结果的原因,评估并行架构效能
|
||||
|
||||
---
|
||||
|
||||
## 执行摘要
|
||||
|
||||
通过深入的代码分析和实验测试,我们发现了向量搜索在当前实现中的几个关键问题,并提供了针对性的优化方案。
|
||||
|
||||
### 核心发现
|
||||
|
||||
1. **向量搜索返回空结果的根本原因**:缺少向量嵌入数据(semantic_chunks表为空)
|
||||
2. **混合搜索架构设计优秀**:使用了双层并行架构,性能表现良好
|
||||
3. **向量搜索模式的语义问题**:"vector模式"实际上总是包含exact搜索,不是纯向量搜索
|
||||
|
||||
---
|
||||
|
||||
## 1. 问题诊断
|
||||
|
||||
### 1.1 向量索引数据库位置
|
||||
|
||||
**存储架构**:
|
||||
- **位置**: 向量数据集成存储在SQLite索引文件中(`_index.db`)
|
||||
- **表名**: `semantic_chunks`
|
||||
- **字段结构**:
|
||||
- `id`: 主键
|
||||
- `file_path`: 文件路径
|
||||
- `content`: 代码块内容
|
||||
- `embedding`: 向量嵌入(BLOB格式,numpy float32数组)
|
||||
- `metadata`: JSON格式元数据
|
||||
- `created_at`: 创建时间
|
||||
|
||||
**默认存储路径**:
|
||||
- 全局索引: `~/.codexlens/indexes/`
|
||||
- 项目索引: `项目目录/.codexlens/`
|
||||
- 每个目录一个 `_index.db` 文件
|
||||
|
||||
**为什么没有看到向量数据库**:
|
||||
向量数据不是独立数据库,而是与FTS索引共存于同一个SQLite文件中的`semantic_chunks`表。如果该表不存在或为空,说明从未生成过向量嵌入。
|
||||
|
||||
### 1.2 向量搜索返回空结果的原因
|
||||
|
||||
**代码分析** (`hybrid_search.py:195-253`):
|
||||
|
||||
```python
|
||||
def _search_vector(self, index_path: Path, query: str, limit: int) -> List[SearchResult]:
|
||||
try:
|
||||
# 检查1: semantic_chunks表是否存在
|
||||
conn = sqlite3.connect(index_path)
|
||||
cursor = conn.execute(
|
||||
"SELECT name FROM sqlite_master WHERE type='table' AND name='semantic_chunks'"
|
||||
)
|
||||
has_semantic_table = cursor.fetchone() is not None
|
||||
conn.close()
|
||||
|
||||
if not has_semantic_table:
|
||||
self.logger.debug("No semantic_chunks table found")
|
||||
return [] # ❌ 返回空列表
|
||||
|
||||
# 检查2: 向量存储是否有数据
|
||||
vector_store = VectorStore(index_path)
|
||||
if vector_store.count_chunks() == 0:
|
||||
self.logger.debug("Vector store is empty")
|
||||
return [] # ❌ 返回空列表
|
||||
|
||||
# 正常向量搜索流程...
|
||||
except Exception as exc:
|
||||
return [] # ❌ 异常也返回空列表
|
||||
```
|
||||
|
||||
**失败路径**:
|
||||
1. `semantic_chunks`表不存在 → 返回空
|
||||
2. 表存在但无数据 → 返回空
|
||||
3. 语义搜索依赖未安装 → 返回空
|
||||
4. 任何异常 → 返回空
|
||||
|
||||
**当前状态诊断**:
|
||||
通过测试验证,当前项目中:
|
||||
- ✗ `semantic_chunks`表不存在
|
||||
- ✗ 未执行向量嵌入生成流程
|
||||
- ✗ 向量索引从未创建
|
||||
|
||||
**解决方案**:需要执行向量嵌入生成流程(见第3节)
|
||||
|
||||
### 1.3 混合搜索 vs 向量搜索的实际行为
|
||||
|
||||
**重要发现**:当前实现中,"vector模式"并非纯向量搜索。
|
||||
|
||||
**代码证据** (`hybrid_search.py:72-77`):
|
||||
|
||||
```python
|
||||
def search(self, ...):
|
||||
# Determine which backends to use
|
||||
backends = {"exact": True} # ⚠️ exact搜索总是启用!
|
||||
if enable_fuzzy:
|
||||
backends["fuzzy"] = True
|
||||
if enable_vector:
|
||||
backends["vector"] = True
|
||||
```
|
||||
|
||||
**影响**:
|
||||
- 即使设置为"vector模式"(`enable_fuzzy=False, enable_vector=True`),exact搜索仍然运行
|
||||
- 当向量搜索返回空时,RRF融合仍会包含exact搜索的结果
|
||||
- 这导致"向量搜索"在没有嵌入数据时仍返回结果(来自exact FTS)
|
||||
|
||||
**测试验证**:
|
||||
```
|
||||
测试场景:有FTS索引但无向量嵌入
|
||||
查询:"authentication"
|
||||
|
||||
预期行为(纯向量模式):
|
||||
- 向量搜索: 0 结果(无嵌入数据)
|
||||
- 最终结果: 0
|
||||
|
||||
实际行为:
|
||||
- 向量搜索: 0 结果
|
||||
- Exact搜索: 3 结果 ✓ (总是运行)
|
||||
- 最终结果: 3(来自exact,经过RRF)
|
||||
```
|
||||
|
||||
**设计建议**:
|
||||
1. **选项A(推荐)**: 添加纯向量模式标志
|
||||
```python
|
||||
backends = {}
|
||||
if enable_vector and not pure_vector_mode:
|
||||
backends["exact"] = True # 向量搜索的后备方案
|
||||
elif not enable_vector:
|
||||
backends["exact"] = True # 非向量模式总是启用exact
|
||||
```
|
||||
|
||||
2. **选项B**: 文档明确说明当前行为
|
||||
- "vector模式"实际是"vector+exact混合模式"
|
||||
- 提供警告信息当向量搜索返回空时
|
||||
|
||||
---
|
||||
|
||||
## 2. 并行架构分析
|
||||
|
||||
### 2.1 双层并行设计
|
||||
|
||||
CodexLens采用了优秀的双层并行架构:
|
||||
|
||||
**第一层:搜索方法级并行** (`HybridSearchEngine`)
|
||||
|
||||
```python
|
||||
def _search_parallel(self, index_path, query, backends, limit):
|
||||
with ThreadPoolExecutor(max_workers=len(backends)) as executor:
|
||||
# 并行提交搜索任务
|
||||
if backends.get("exact"):
|
||||
future = executor.submit(self._search_exact, ...)
|
||||
if backends.get("fuzzy"):
|
||||
future = executor.submit(self._search_fuzzy, ...)
|
||||
if backends.get("vector"):
|
||||
future = executor.submit(self._search_vector, ...)
|
||||
|
||||
# 收集结果
|
||||
for future in as_completed(future_to_source):
|
||||
results = future.result()
|
||||
```
|
||||
|
||||
**特点**:
|
||||
- 在**单个索引**内,exact/fuzzy/vector三种搜索方法并行执行
|
||||
- 使用`ThreadPoolExecutor`实现I/O密集型任务并行
|
||||
- 使用`as_completed`实现结果流式收集
|
||||
- 动态worker数量(与启用的backend数量相同)
|
||||
|
||||
**性能测试结果**:
|
||||
```
|
||||
搜索模式 | 平均延迟 | 相对overhead
|
||||
-----------|----------|-------------
|
||||
Exact only | 5.6ms | 1.0x (基线)
|
||||
Fuzzy only | 7.7ms | 1.4x
|
||||
Vector only| 7.4ms | 1.3x
|
||||
Hybrid (all)| 9.0ms | 1.6x
|
||||
```
|
||||
|
||||
**分析**:
|
||||
- ✓ Hybrid模式开销合理(<2x),证明并行有效
|
||||
- ✓ 单次搜索延迟仍保持在10ms以下(优秀)
|
||||
|
||||
**第二层:索引级并行** (`ChainSearchEngine`)
|
||||
|
||||
```python
|
||||
def _search_parallel(self, index_paths, query, options):
|
||||
executor = self._get_executor(options.max_workers)
|
||||
|
||||
# 为每个索引提交搜索任务
|
||||
future_to_path = {
|
||||
executor.submit(
|
||||
self._search_single_index,
|
||||
idx_path, query, ...
|
||||
): idx_path
|
||||
for idx_path in index_paths
|
||||
}
|
||||
|
||||
# 收集所有索引的结果
|
||||
for future in as_completed(future_to_path):
|
||||
results = future.result()
|
||||
all_results.extend(results)
|
||||
```
|
||||
|
||||
**特点**:
|
||||
- 跨**多个目录索引**并行搜索
|
||||
- 共享线程池(避免线程创建开销)
|
||||
- 可配置worker数量(默认8)
|
||||
- 结果去重和RRF融合
|
||||
|
||||
### 2.2 并行效能评估
|
||||
|
||||
**优势**:
|
||||
1. ✓ **架构清晰**:双层并行职责明确,互不干扰
|
||||
2. ✓ **资源利用**:I/O密集型任务充分利用线程池
|
||||
3. ✓ **扩展性**:易于添加新的搜索后端
|
||||
4. ✓ **容错性**:单个后端失败不影响其他后端
|
||||
|
||||
**当前利用率**:
|
||||
- 单索引搜索:并行度 = min(3, 启用的backend数量)
|
||||
- 多索引搜索:并行度 = min(8, 索引数量)
|
||||
- **充分发挥**:只要有多个索引或多个backend
|
||||
|
||||
**潜在优化点**:
|
||||
1. **CPU密集型任务**:向量相似度计算已使用numpy向量化,无需额外并行
|
||||
2. **缓存优化**:`VectorStore`已实现embedding matrix缓存,性能良好
|
||||
3. **动态worker调度**:当前固定worker数,可根据任务负载动态调整
|
||||
|
||||
---
|
||||
|
||||
## 3. 解决方案与优化建议
|
||||
|
||||
### 3.1 立即修复:生成向量嵌入
|
||||
|
||||
**步骤1:安装语义搜索依赖**
|
||||
|
||||
```bash
|
||||
# 方式A:完整安装
|
||||
pip install codexlens[semantic]
|
||||
|
||||
# 方式B:手动安装依赖
|
||||
pip install fastembed numpy
|
||||
```
|
||||
|
||||
**步骤2:创建向量索引脚本**
|
||||
|
||||
保存为 `scripts/generate_embeddings.py`:
|
||||
|
||||
```python
|
||||
"""Generate vector embeddings for existing indexes."""
|
||||
|
||||
import logging
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
from codexlens.semantic.embedder import Embedder
|
||||
from codexlens.semantic.vector_store import VectorStore
|
||||
from codexlens.semantic.chunker import Chunker, ChunkConfig
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def generate_embeddings_for_index(index_db_path: Path):
|
||||
"""Generate embeddings for all files in an index."""
|
||||
logger.info(f"Processing index: {index_db_path}")
|
||||
|
||||
# Initialize components
|
||||
embedder = Embedder(profile="code") # Use code-optimized model
|
||||
vector_store = VectorStore(index_db_path)
|
||||
chunker = Chunker(config=ChunkConfig(max_chunk_size=2000))
|
||||
|
||||
# Read files from index
|
||||
with sqlite3.connect(index_db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute("SELECT full_path, content, language FROM files")
|
||||
files = cursor.fetchall()
|
||||
|
||||
logger.info(f"Found {len(files)} files to process")
|
||||
|
||||
# Process each file
|
||||
total_chunks = 0
|
||||
for file_row in files:
|
||||
file_path = file_row["full_path"]
|
||||
content = file_row["content"]
|
||||
language = file_row["language"] or "python"
|
||||
|
||||
try:
|
||||
# Create chunks
|
||||
chunks = chunker.chunk_sliding_window(
|
||||
content,
|
||||
file_path=file_path,
|
||||
language=language
|
||||
)
|
||||
|
||||
if not chunks:
|
||||
logger.debug(f"No chunks created for {file_path}")
|
||||
continue
|
||||
|
||||
# Generate embeddings
|
||||
for chunk in chunks:
|
||||
embedding = embedder.embed_single(chunk.content)
|
||||
chunk.embedding = embedding
|
||||
|
||||
# Store chunks
|
||||
vector_store.add_chunks(chunks, file_path)
|
||||
total_chunks += len(chunks)
|
||||
logger.info(f"✓ {file_path}: {len(chunks)} chunks")
|
||||
|
||||
except Exception as exc:
|
||||
logger.error(f"✗ {file_path}: {exc}")
|
||||
|
||||
logger.info(f"Completed: {total_chunks} total chunks indexed")
|
||||
return total_chunks
|
||||
|
||||
|
||||
def main():
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python generate_embeddings.py <index_db_path>")
|
||||
print("Example: python generate_embeddings.py ~/.codexlens/indexes/project/_index.db")
|
||||
sys.exit(1)
|
||||
|
||||
index_path = Path(sys.argv[1])
|
||||
|
||||
if not index_path.exists():
|
||||
print(f"Error: Index not found at {index_path}")
|
||||
sys.exit(1)
|
||||
|
||||
generate_embeddings_for_index(index_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
**步骤3:执行生成**
|
||||
|
||||
```bash
|
||||
# 为特定项目生成嵌入
|
||||
python scripts/generate_embeddings.py ~/.codexlens/indexes/codex-lens/_index.db
|
||||
|
||||
# 或使用find批量处理
|
||||
find ~/.codexlens/indexes -name "_index.db" -type f | while read db; do
|
||||
python scripts/generate_embeddings.py "$db"
|
||||
done
|
||||
```
|
||||
|
||||
**步骤4:验证生成结果**
|
||||
|
||||
```bash
|
||||
# 检查semantic_chunks表
|
||||
sqlite3 ~/.codexlens/indexes/codex-lens/_index.db \
|
||||
"SELECT COUNT(*) as chunk_count FROM semantic_chunks"
|
||||
|
||||
# 测试向量搜索
|
||||
codexlens search "authentication user credentials" \
|
||||
--path ~/projects/codex-lens \
|
||||
--mode vector
|
||||
```
|
||||
|
||||
### 3.2 短期优化:改进向量搜索语义
|
||||
|
||||
**问题**:当前"vector模式"实际包含exact搜索,语义不清晰
|
||||
|
||||
**解决方案**:添加`pure_vector`参数
|
||||
|
||||
**实现** (修改 `hybrid_search.py`):
|
||||
|
||||
```python
|
||||
class HybridSearchEngine:
|
||||
def search(
|
||||
self,
|
||||
index_path: Path,
|
||||
query: str,
|
||||
limit: int = 20,
|
||||
enable_fuzzy: bool = True,
|
||||
enable_vector: bool = False,
|
||||
pure_vector: bool = False, # 新增参数
|
||||
) -> List[SearchResult]:
|
||||
"""Execute hybrid search with parallel retrieval and RRF fusion.
|
||||
|
||||
Args:
|
||||
...
|
||||
pure_vector: If True, only use vector search (no FTS fallback)
|
||||
"""
|
||||
# Determine which backends to use
|
||||
backends = {}
|
||||
|
||||
if pure_vector:
|
||||
# 纯向量模式:只使用向量搜索
|
||||
if enable_vector:
|
||||
backends["vector"] = True
|
||||
else:
|
||||
# 混合模式:总是包含exact搜索作为基线
|
||||
backends["exact"] = True
|
||||
if enable_fuzzy:
|
||||
backends["fuzzy"] = True
|
||||
if enable_vector:
|
||||
backends["vector"] = True
|
||||
|
||||
# ... rest of the method
|
||||
```
|
||||
|
||||
**CLI更新** (修改 `commands.py`):
|
||||
|
||||
```python
|
||||
@app.command()
|
||||
def search(
|
||||
...
|
||||
mode: str = typer.Option("exact", "--mode", "-m",
|
||||
help="Search mode: exact, fuzzy, hybrid, vector, pure-vector."),
|
||||
...
|
||||
):
|
||||
"""...
|
||||
Search Modes:
|
||||
- exact: Exact FTS
|
||||
- fuzzy: Fuzzy FTS
|
||||
- hybrid: RRF fusion of exact + fuzzy + vector (recommended)
|
||||
- vector: Vector search with exact FTS fallback
|
||||
- pure-vector: Pure semantic vector search (no FTS fallback)
|
||||
"""
|
||||
...
|
||||
|
||||
# Map mode to options
|
||||
if mode == "exact":
|
||||
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, False, False, False
|
||||
elif mode == "fuzzy":
|
||||
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = False, True, False, False
|
||||
elif mode == "vector":
|
||||
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, False
|
||||
elif mode == "pure-vector":
|
||||
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, False, True, True
|
||||
elif mode == "hybrid":
|
||||
hybrid_mode, enable_fuzzy, enable_vector, pure_vector = True, True, True, False
|
||||
```
|
||||
|
||||
### 3.3 中期优化:增强向量搜索效果
|
||||
|
||||
**优化1:改进分块策略**
|
||||
|
||||
当前使用简单的滑动窗口,可优化为:
|
||||
|
||||
```python
|
||||
class HybridChunker(Chunker):
|
||||
"""Hybrid chunking strategy combining symbol-based and sliding window."""
|
||||
|
||||
def chunk_hybrid(
|
||||
self,
|
||||
content: str,
|
||||
symbols: List[Symbol],
|
||||
file_path: str,
|
||||
language: str,
|
||||
) -> List[SemanticChunk]:
|
||||
"""
|
||||
1. 优先按symbol分块(函数、类级别)
|
||||
2. 对过大symbol,进一步使用滑动窗口
|
||||
3. 对symbol间隙,使用滑动窗口补充
|
||||
"""
|
||||
chunks = []
|
||||
|
||||
# Step 1: Symbol-based chunks
|
||||
symbol_chunks = self.chunk_by_symbol(content, symbols, file_path, language)
|
||||
|
||||
# Step 2: Split oversized symbols
|
||||
for chunk in symbol_chunks:
|
||||
if chunk.token_count > self.config.max_chunk_size:
|
||||
# 使用滑动窗口进一步分割
|
||||
sub_chunks = self._split_large_chunk(chunk)
|
||||
chunks.extend(sub_chunks)
|
||||
else:
|
||||
chunks.append(chunk)
|
||||
|
||||
# Step 3: Fill gaps with sliding window
|
||||
gap_chunks = self._chunk_gaps(content, symbols, file_path, language)
|
||||
chunks.extend(gap_chunks)
|
||||
|
||||
return chunks
|
||||
```
|
||||
|
||||
**优化2:添加查询扩展**
|
||||
|
||||
```python
|
||||
class QueryExpander:
|
||||
"""Expand queries for better vector search recall."""
|
||||
|
||||
def expand(self, query: str) -> str:
|
||||
"""Expand query with synonyms and related terms."""
|
||||
# 示例:代码领域同义词
|
||||
expansions = {
|
||||
"auth": ["authentication", "authorization", "login"],
|
||||
"db": ["database", "storage", "repository"],
|
||||
"api": ["endpoint", "route", "interface"],
|
||||
}
|
||||
|
||||
terms = query.lower().split()
|
||||
expanded = set(terms)
|
||||
|
||||
for term in terms:
|
||||
if term in expansions:
|
||||
expanded.update(expansions[term])
|
||||
|
||||
return " ".join(expanded)
|
||||
```
|
||||
|
||||
**优化3:混合检索策略**
|
||||
|
||||
```python
|
||||
class AdaptiveHybridSearch:
|
||||
"""Adaptive search strategy based on query type."""
|
||||
|
||||
def search(self, query: str, ...):
|
||||
# 分析查询类型
|
||||
query_type = self._classify_query(query)
|
||||
|
||||
if query_type == "keyword":
|
||||
# 代码标识符查询 → 偏重FTS
|
||||
weights = {"exact": 0.5, "fuzzy": 0.3, "vector": 0.2}
|
||||
elif query_type == "semantic":
|
||||
# 自然语言查询 → 偏重向量
|
||||
weights = {"exact": 0.2, "fuzzy": 0.2, "vector": 0.6}
|
||||
elif query_type == "hybrid":
|
||||
# 混合查询 → 平衡权重
|
||||
weights = {"exact": 0.4, "fuzzy": 0.3, "vector": 0.3}
|
||||
|
||||
return self.engine.search(query, weights=weights, ...)
|
||||
```
|
||||
|
||||
### 3.4 长期优化:性能与质量提升
|
||||
|
||||
**优化1:增量嵌入更新**
|
||||
|
||||
```python
|
||||
class IncrementalEmbeddingUpdater:
|
||||
"""Update embeddings incrementally for changed files."""
|
||||
|
||||
def update_for_file(self, file_path: str, new_content: str):
|
||||
"""Only regenerate embeddings for changed file."""
|
||||
# 1. 删除旧嵌入
|
||||
self.vector_store.delete_file_chunks(file_path)
|
||||
|
||||
# 2. 生成新嵌入
|
||||
chunks = self.chunker.chunk(new_content, ...)
|
||||
for chunk in chunks:
|
||||
chunk.embedding = self.embedder.embed_single(chunk.content)
|
||||
|
||||
# 3. 存储新嵌入
|
||||
self.vector_store.add_chunks(chunks, file_path)
|
||||
```
|
||||
|
||||
**优化2:向量索引压缩**
|
||||
|
||||
```python
|
||||
# 使用量化技术减少存储空间(768维 → 192维)
|
||||
from qdrant_client import models
|
||||
|
||||
# 产品量化(PQ)压缩
|
||||
compressed_vector = pq_quantize(embedding, target_dim=192)
|
||||
```
|
||||
|
||||
**优化3:向量搜索加速**
|
||||
|
||||
```python
|
||||
# 使用FAISS或Hnswlib替代numpy暴力搜索
|
||||
import faiss
|
||||
|
||||
class FAISSVectorStore(VectorStore):
|
||||
def __init__(self, db_path, dim=768):
|
||||
super().__init__(db_path)
|
||||
# 使用HNSW索引
|
||||
self.index = faiss.IndexHNSWFlat(dim, 32)
|
||||
self._load_vectors_to_index()
|
||||
|
||||
def search_similar(self, query_embedding, top_k=10):
|
||||
# FAISS加速搜索(100x+)
|
||||
scores, indices = self.index.search(
|
||||
np.array([query_embedding]), top_k
|
||||
)
|
||||
return self._fetch_by_indices(indices[0], scores[0])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 对比总结
|
||||
|
||||
### 4.1 搜索模式对比
|
||||
|
||||
| 维度 | Exact FTS | Fuzzy FTS | Vector Search | Hybrid (推荐) |
|
||||
|------|-----------|-----------|---------------|--------------|
|
||||
| **匹配类型** | 精确词匹配 | 容错匹配 | 语义相似 | 多模式融合 |
|
||||
| **查询类型** | 标识符、关键词 | 拼写错误容忍 | 自然语言 | 所有类型 |
|
||||
| **召回率** | 中 | 高 | 最高 | 最高 |
|
||||
| **精确率** | 高 | 中 | 中 | 高 |
|
||||
| **延迟** | 5-7ms | 7-9ms | 7-10ms | 9-11ms |
|
||||
| **依赖** | 仅SQLite | 仅SQLite | fastembed+numpy | 全部 |
|
||||
| **存储开销** | 小(FTS索引) | 小(FTS索引) | 大(向量) | 大(FTS+向量) |
|
||||
| **适用场景** | 代码搜索 | 容错搜索 | 概念搜索 | 通用搜索 |
|
||||
|
||||
### 4.2 推荐使用策略
|
||||
|
||||
**场景1:代码标识符搜索**(函数名、类名、变量名)
|
||||
```bash
|
||||
codexlens search "authenticate_user" --mode exact
|
||||
```
|
||||
→ 使用exact模式,最快且最精确
|
||||
|
||||
**场景2:概念性搜索**("如何验证用户身份")
|
||||
```bash
|
||||
codexlens search "how to verify user credentials" --mode hybrid
|
||||
```
|
||||
→ 使用hybrid模式,结合语义和关键词
|
||||
|
||||
**场景3:容错搜索**(允许拼写错误)
|
||||
```bash
|
||||
codexlens search "autheticate" --mode fuzzy
|
||||
```
|
||||
→ 使用fuzzy模式,trigram容错
|
||||
|
||||
**场景4:纯语义搜索**(需先生成嵌入)
|
||||
```bash
|
||||
codexlens search "password encryption with salt" --mode pure-vector
|
||||
```
|
||||
→ 使用pure-vector模式,理解语义意图
|
||||
|
||||
---
|
||||
|
||||
## 5. 实施检查清单
|
||||
|
||||
### 立即行动项 (P0)
|
||||
|
||||
- [ ] 安装语义搜索依赖:`pip install codexlens[semantic]`
|
||||
- [ ] 运行嵌入生成脚本(见3.1节)
|
||||
- [ ] 验证semantic_chunks表已创建且有数据
|
||||
- [ ] 测试vector模式搜索是否返回结果
|
||||
|
||||
### 短期改进 (P1)
|
||||
|
||||
- [ ] 添加pure_vector参数(见3.2节)
|
||||
- [ ] 更新CLI支持pure-vector模式
|
||||
- [ ] 添加嵌入生成进度提示
|
||||
- [ ] 文档更新:搜索模式使用指南
|
||||
|
||||
### 中期优化 (P2)
|
||||
|
||||
- [ ] 实现混合分块策略(见3.3节)
|
||||
- [ ] 添加查询扩展功能
|
||||
- [ ] 实现自适应权重调整
|
||||
- [ ] 性能基准测试
|
||||
|
||||
### 长期规划 (P3)
|
||||
|
||||
- [ ] 增量嵌入更新机制
|
||||
- [ ] 向量索引压缩
|
||||
- [ ] 集成FAISS加速
|
||||
- [ ] 多模态搜索(代码+文档)
|
||||
|
||||
---
|
||||
|
||||
## 6. 参考资源
|
||||
|
||||
### 代码文件
|
||||
|
||||
- 混合搜索引擎: `codex-lens/src/codexlens/search/hybrid_search.py`
|
||||
- 向量存储: `codex-lens/src/codexlens/semantic/vector_store.py`
|
||||
- 向量嵌入: `codex-lens/src/codexlens/semantic/embedder.py`
|
||||
- 代码分块: `codex-lens/src/codexlens/semantic/chunker.py`
|
||||
- 链式搜索: `codex-lens/src/codexlens/search/chain_search.py`
|
||||
|
||||
### 测试文件
|
||||
|
||||
- 对比测试: `codex-lens/tests/test_search_comparison.py`
|
||||
- 混合搜索E2E: `codex-lens/tests/test_hybrid_search_e2e.py`
|
||||
- CLI测试: `codex-lens/tests/test_cli_hybrid_search.py`
|
||||
|
||||
### 相关文档
|
||||
|
||||
- RRF算法: `codex-lens/src/codexlens/search/ranking.py`
|
||||
- 查询解析: `codex-lens/src/codexlens/search/query_parser.py`
|
||||
- 配置管理: `codex-lens/src/codexlens/config.py`
|
||||
|
||||
---
|
||||
|
||||
## 7. 结论
|
||||
|
||||
通过本次深入分析,我们明确了CodexLens搜索系统的优势和待优化点:
|
||||
|
||||
**优势**:
|
||||
1. ✓ 优秀的并行架构设计(双层并行)
|
||||
2. ✓ RRF融合算法实现合理
|
||||
3. ✓ 向量存储实现高效(numpy向量化+缓存)
|
||||
4. ✓ 模块化设计,易于扩展
|
||||
|
||||
**待优化**:
|
||||
1. 向量嵌入生成流程需要手动触发
|
||||
2. "vector模式"语义不清晰(实际包含exact搜索)
|
||||
3. 分块策略可以优化(混合策略)
|
||||
4. 缺少增量更新机制
|
||||
|
||||
**核心建议**:
|
||||
1. **立即**: 生成向量嵌入,解决返回空结果问题
|
||||
2. **短期**: 添加纯向量模式,澄清语义
|
||||
3. **中期**: 优化分块和查询策略,提升搜索质量
|
||||
4. **长期**: 性能优化和高级特性
|
||||
|
||||
通过实施这些改进,CodexLens的搜索功能将达到生产级别的质量和性能标准。
|
||||
|
||||
---
|
||||
|
||||
**报告完成时间**: 2025-12-16
|
||||
**分析工具**: 代码静态分析 + 实验测试 + 性能测评
|
||||
**下一步**: 实施P0优先级改进项
|
||||
Reference in New Issue
Block a user