mirror of https://github.com/catlog22/Claude-Code-Workflow.git synced 2026-02-04 01:40:45 +08:00

Files

catlog22 74a830694c Fix CodexLens embeddings generation to achieve 100% coverage

Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files).
This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage
with 2,042 semantic chunks across all 303 files in 26 index databases.

Key improvements:

1. **Recursive embeddings generation** (embedding_manager.py):
   - Add generate_embeddings_recursive() to process all _index.db files in directory tree
   - Add get_embeddings_status() for comprehensive coverage statistics
   - Add discover_all_index_dbs() helper for recursive file discovery

2. **Enhanced CLI commands** (commands.py):
   - embeddings-generate: Add --recursive flag for full project coverage
   - init: Use recursive generation by default for complete indexing
   - status: Display embeddings coverage statistics with 50% threshold

3. **Smart search routing improvements** (smart-search.ts):
   - Add 50% embeddings coverage threshold for hybrid mode routing
   - Auto-fallback to exact mode when coverage insufficient
   - Strip ANSI color codes from JSON output for correct parsing
   - Add embeddings_coverage_percent to IndexStatus and SearchMetadata
   - Provide clear warnings with actionable suggestions

4. **Documentation and analysis**:
   - Add SMART_SEARCH_ANALYSIS.md with initial investigation
   - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue
   - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary
   - Add check_embeddings.py script for coverage verification

Results:
- Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase
- Semantic chunks increased from 10 to 2,042 - 204x increase
- All 26 subdirectory indexes now have embeddings vs just 1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-17 17:54:33 +08:00

5.6 KiB

Raw Permalink Blame History

Smart Search 索引分析报告

问题

分析当前 smart_search(action="init") 是否进行了向量模型索引，还是仅进行了基础索引。

分析结果

1. Init 操作的默认行为

从代码分析来看，smart_search(action="init") 的行为如下：

代码路径：ccw/src/tools/smart-search.ts → ccw/src/tools/codex-lens.ts

// smart-search.ts: executeInitAction (第 297-323 行)
async function executeInitAction(params: Params): Promise<SearchResult> {
  const { path = '.', languages } = params;
  const args = ['init', path];
  if (languages && languages.length > 0) {
    args.push('--languages', languages.join(','));
  }
  const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
  // ...
}

关键发现：

smart_search(action="init") 调用 codexlens init 命令
不传递 --no-embeddings 参数
不传递 --embedding-model 参数

2. CodexLens Init 的默认行为

根据 codexlens init --help 的输出：

If semantic search dependencies are installed, automatically generates embeddings after indexing completes. Use --no-embeddings to skip this step.

结论：

✅ init 命令默认会生成 embeddings（如果安装了语义搜索依赖）
❌ 当前实现未生成所有文件的 embeddings

3. 实际测试结果

第一次 Init（未生成 embeddings）

$ smart_search(action="init", path="d:\\Claude_dms3\\ccw")
# 结果：索引了 303 个文件，但 vector_search: false

原因分析：虽然语义搜索依赖（fastembed）已安装，但 init 过程中遇到警告：

Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate.

手动生成 Embeddings 后

$ python -m codexlens embeddings-generate . --force --verbose

Processing 5 files...
- D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks
- D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks
- D:\Claude_dms3\ccw\README.md: 2 chunks
- D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks
- D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks

Total: 10 chunks, 5 files
Model: jinaai/jina-embeddings-v2-base-code (768 dimensions)

关键发现：

⚠️ 只为 5 个文档/配置文件生成了 embeddings
⚠️ 未为 298 个代码文件（.ts, .js 等）生成 embeddings
✅ Embeddings 状态显示 coverage_percent: 100.0（但这是针对"应该生成 embeddings 的文件"而言）

Hybrid Search 测试

$ smart_search(query="authentication and authorization patterns", mode="hybrid")
# ✅ 成功返回 5 个结果，带有相似度分数
# ✅ 证明向量搜索功能可用

4. 索引类型对比

索引类型	当前状态	支持的文件	说明
Exact FTS	✅ 启用	所有 303 个文件	基于 SQLite FTS5 的全文搜索
Fuzzy FTS	❌ 未启用	-	模糊匹配搜索
Vector Search	⚠️ 部分启用	仅 5 个文档文件	基于 fastembed 的语义搜索
Hybrid Search	⚠️ 部分启用	仅 5 个文档文件	RRF 融合（exact + fuzzy + vector）

5. 为什么只有 5 个文件有 Embeddings？

可能的原因：

文件类型过滤：CodexLens 可能只为文档文件（.md）和配置文件生成 embeddings
代码文件使用符号索引：代码文件（.ts, .js）可能依赖于符号提取而非文本 embeddings
性能考虑：生成 300+ 文件的 embeddings 需要大量时间和存储空间

6. 结论

当前 `smart_search(action="init")` 的行为：

✅ 会尝试生成向量索引（如果语义依赖已安装）
⚠️ 实际只为文档/配置文件生成 embeddings（5/303 文件）
✅ 支持 hybrid 模式搜索（对于有 embeddings 的文件）
✅ 支持 exact 模式搜索（对于所有 303 个文件）

搜索模式智能路由：

用户查询 → auto 模式 → 决策树：
  ├─ 自然语言查询 + 有 embeddings → hybrid 模式（RRF 融合）
  ├─ 简单查询 + 有索引 → exact 模式（FTS）
  └─ 无索引 → ripgrep 模式（字面匹配）

7. 建议

如果需要完整的语义搜索支持：

# 方案 1：检查是否所有代码文件都应该有 embeddings
python -m codexlens embeddings-status . --verbose

# 方案 2：明确为代码文件生成 embeddings（如果支持）
# 需要查看 CodexLens 文档确认代码文件的语义索引策略

# 方案 3：使用 hybrid 模式进行文档搜索，exact 模式进行代码搜索
smart_search(query="架构设计", mode="hybrid")  # 文档语义搜索
smart_search(query="function_name", mode="exact")  # 代码精确搜索

当前最佳实践：

// 1. 初始化索引（一次性）
smart_search(action="init", path=".")

// 2. 智能搜索（推荐使用 auto 模式）
smart_search(query="your query")  // 自动选择最佳模式

// 3. 特定模式搜索
smart_search(query="natural language query", mode="hybrid")  // 语义搜索
smart_search(query="exact_identifier", mode="exact")         // 精确匹配
smart_search(query="quick literal", mode="ripgrep")          // 快速字面搜索

8. 技术细节

Embeddings 模型

模型：jinaai/jina-embeddings-v2-base-code
维度：768
大小：~150MB
后端：fastembed (ONNX-based)

索引存储

位置：C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db
大小：122.57 MB
Schema 版本：5
文件数：303
目录数：26

生成时间：2025-12-17
CodexLens 版本：从当前安装中检测

5.6 KiB Raw Permalink Blame History Unescape Escape