mirror of https://github.com/catlog22/Claude-Code-Workflow.git synced 2026-02-05 01:50:27 +08:00

Files

catlog22 df23975a0b Add comprehensive tests for schema cleanup migration and search comparison

- Implement tests for migration 005 to verify removal of deprecated fields in the database schema.
- Ensure that new databases are created with a clean schema.
- Validate that keywords are correctly extracted from the normalized file_keywords table.
- Test symbol insertion without deprecated fields and subdir operations without direct_files.
- Create a detailed search comparison test to evaluate vector search vs hybrid search performance.
- Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality.
- Include a test script to check TreeSitter parser availability and relationship extraction from sample files.

2025-12-16 19:27:05 +08:00

5.2 KiB

Raw Permalink Blame History

CodexLens 搜索分析 - 执行摘要

🎯 核心发现

问题1：向量搜索为什么返回空结果？

根本原因：向量嵌入数据不存在

✗ semantic_chunks 表未创建
✗ 从未执行向量嵌入生成流程
✗ 向量索引数据库实际是 SQLite 中的一个表，不是独立文件

位置：向量数据存储在 ~/.codexlens/indexes/项目名/_index.db 的 semantic_chunks 表中

问题2：向量索引数据库在哪里？

存储架构：

~/.codexlens/indexes/
└── project-name/
    └── _index.db          ← SQLite数据库
        ├── files          ← 文件索引表
        ├── files_fts      ← FTS5全文索引
        ├── files_fts_fuzzy ← 模糊搜索索引
        └── semantic_chunks ← 向量嵌入表（当前不存在！）

不是独立数据库：向量数据集成在 SQLite 索引文件中，而不是单独的向量数据库。

问题3：当前架构是否发挥了并行效果？

✓ 是的！架构非常优秀

双层并行：
- 第1层：单索引内，exact/fuzzy/vector 三种搜索方法并行
- 第2层：跨多个目录索引并行搜索
性能表现：混合模式仅增加 1.6x 开销（9ms vs 5.6ms）
资源利用：ThreadPoolExecutor 充分利用 I/O 并发

⚡ 快速修复

立即解决向量搜索问题

步骤1：安装依赖

pip install codexlens[semantic]
# 或
pip install fastembed numpy

步骤2：生成向量嵌入

创建脚本 generate_embeddings.py:

from pathlib import Path
from codexlens.semantic.embedder import Embedder
from codexlens.semantic.vector_store import VectorStore
from codexlens.semantic.chunker import Chunker, ChunkConfig
import sqlite3

def generate_embeddings(index_db_path: Path):
    embedder = Embedder(profile="code")
    vector_store = VectorStore(index_db_path)
    chunker = Chunker(config=ChunkConfig(max_chunk_size=2000))

    with sqlite3.connect(index_db_path) as conn:
        conn.row_factory = sqlite3.Row
        files = conn.execute("SELECT full_path, content FROM files").fetchall()

    for file_row in files:
        chunks = chunker.chunk_sliding_window(
            file_row["content"],
            file_path=file_row["full_path"],
            language="python"
        )
        for chunk in chunks:
            chunk.embedding = embedder.embed_single(chunk.content)
        if chunks:
            vector_store.add_chunks(chunks, file_row["full_path"])

步骤3：执行生成

python generate_embeddings.py ~/.codexlens/indexes/codex-lens/_index.db

步骤4：验证

# 检查数据
sqlite3 ~/.codexlens/indexes/codex-lens/_index.db \
    "SELECT COUNT(*) FROM semantic_chunks"

# 测试搜索
codexlens search "authentication credentials" --mode vector

🔍 关键洞察

发现：Vector模式不是纯向量搜索

当前行为：

# hybrid_search.py:73
backends = {"exact": True}  # ⚠️ exact搜索总是启用！
if enable_vector:
    backends["vector"] = True

影响：

"vector模式"实际是 vector + exact 混合模式
即使向量搜索返回空，仍有exact FTS结果
这就是为什么"向量搜索"在无嵌入时也有结果

建议修复：添加 pure_vector 参数以支持真正的纯向量搜索

📊 搜索模式对比

模式	延迟	召回率	适用场景	需要嵌入
exact	5.6ms	中	代码标识符	✗
fuzzy	7.7ms	高	容错搜索	✗
vector	7.4ms	最高	语义搜索	✓
hybrid	9.0ms	最高	通用搜索	✓

📈 优化路线图

P0 - 立即 (本周)

生成向量嵌入
验证向量搜索可用
更新使用文档

P1 - 短期 (2周)

添加 pure_vector 模式
增量嵌入更新
改进错误提示

P2 - 中期 (1-2月)

混合分块策略
查询扩展
自适应权重

P3 - 长期 (3-6月)

FAISS加速
向量压缩
多模态搜索

📚 详细文档

完整分析报告：SEARCH_COMPARISON_ANALYSIS.md

包含内容：

详细问题诊断
架构深度分析
完整解决方案
代码示例
实施检查清单

🎓 学习要点

向量搜索需要主动生成嵌入：不会自动创建
双层并行架构很优秀：无需额外优化
RRF融合算法工作良好：多源结果合理融合
Vector模式非纯向量：包含FTS作为后备

💡 下一步行动

# 1. 安装依赖
pip install codexlens[semantic]

# 2. 创建索引（如果还没有）
codexlens init ~/projects/your-project

# 3. 生成嵌入
python generate_embeddings.py ~/.codexlens/indexes/your-project/_index.db

# 4. 测试搜索
codexlens search "your natural language query" --mode hybrid

问题解决: ✓ 已识别并提供解决方案 架构评估: ✓ 并行架构优秀，充分发挥效能 优化建议: ✓ 提供短期、中期、长期优化路线

联系: 详见 SEARCH_COMPARISON_ANALYSIS.md 获取完整技术细节

5.2 KiB Raw Permalink Blame History Unescape Escape