mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-14 02:42:04 +08:00
Fix CodexLens embeddings generation to achieve 100% coverage
Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files). This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage with 2,042 semantic chunks across all 303 files in 26 index databases. Key improvements: 1. **Recursive embeddings generation** (embedding_manager.py): - Add generate_embeddings_recursive() to process all _index.db files in directory tree - Add get_embeddings_status() for comprehensive coverage statistics - Add discover_all_index_dbs() helper for recursive file discovery 2. **Enhanced CLI commands** (commands.py): - embeddings-generate: Add --recursive flag for full project coverage - init: Use recursive generation by default for complete indexing - status: Display embeddings coverage statistics with 50% threshold 3. **Smart search routing improvements** (smart-search.ts): - Add 50% embeddings coverage threshold for hybrid mode routing - Auto-fallback to exact mode when coverage insufficient - Strip ANSI color codes from JSON output for correct parsing - Add embeddings_coverage_percent to IndexStatus and SearchMetadata - Provide clear warnings with actionable suggestions 4. **Documentation and analysis**: - Add SMART_SEARCH_ANALYSIS.md with initial investigation - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary - Add check_embeddings.py script for coverage verification Results: - Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase - Semantic chunks increased from 10 to 2,042 - 204x increase - All 26 subdirectory indexes now have embeddings vs just 1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
330
ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
Normal file
330
ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Smart Search 索引分析报告(修正版)
|
||||
|
||||
## 用户质疑
|
||||
|
||||
1. ❓ 为什么不为代码文件生成向量 embeddings?
|
||||
2. ❓ Exact FTS 和 Vector 索引内容应该一样才对
|
||||
3. ❓ init 应该返回 FTS 和 vector 索引概况
|
||||
|
||||
**结论:用户的质疑 100% 正确!这是 CodexLens 的设计缺陷。**
|
||||
|
||||
---
|
||||
|
||||
## 真实情况
|
||||
|
||||
### 1. 分层索引架构
|
||||
|
||||
CodexLens 使用**分层目录索引**:
|
||||
|
||||
```
|
||||
D:\Claude_dms3\ccw\
|
||||
├── _index.db ← 根目录索引(5个文件)
|
||||
├── src/
|
||||
│ ├── _index.db ← src目录索引(2个文件)
|
||||
│ ├── tools/
|
||||
│ │ └── _index.db ← tools子目录索引(25个文件)
|
||||
│ └── ...
|
||||
└── ... (总共 26 个 _index.db)
|
||||
```
|
||||
|
||||
### 2. 索引覆盖情况
|
||||
|
||||
| 目录 | 文件数 | FTS索引 | Embeddings |
|
||||
|------|--------|---------|------------|
|
||||
| **根目录** | 5 | ✅ | ✅ (10 chunks) |
|
||||
| bin/ | 2 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| dist/ | 4 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 |
|
||||
| ... | ... | ... | ... |
|
||||
| **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) |
|
||||
|
||||
### 3. 关键发现
|
||||
|
||||
```python
|
||||
# 运行检查脚本的结果
|
||||
Total index databases: 26
|
||||
Directories with embeddings: 1 # ❌ 只有根目录!
|
||||
Total files indexed: 303 # ✅ FTS索引完整
|
||||
Total semantic chunks: 10 # ❌ 只有根目录的5个文件
|
||||
```
|
||||
|
||||
**问题**:
|
||||
- ✅ **所有303个文件**都有 FTS 索引(分布在26个_index.db中)
|
||||
- ❌ **只有5个文件**(1.6%)有 vector embeddings
|
||||
- ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构
|
||||
|
||||
---
|
||||
|
||||
## 为什么会这样?
|
||||
|
||||
### 原因分析
|
||||
|
||||
1. **`init` 操作**:
|
||||
```bash
|
||||
codexlens init .
|
||||
```
|
||||
- ✅ 为所有303个文件创建 FTS 索引(分布式)
|
||||
- ⚠️ 尝试生成 embeddings,但遇到"Index already has 10 chunks"警告
|
||||
- ❌ 只为根目录生成了 embeddings
|
||||
|
||||
2. **`embeddings-generate` 操作**:
|
||||
```bash
|
||||
codexlens embeddings-generate . --force
|
||||
```
|
||||
- ❌ 只处理了根目录的 _index.db
|
||||
- ❌ **未递归处理子目录的索引**
|
||||
- 结果:只有5个文档文件有 embeddings
|
||||
|
||||
### 设计问题
|
||||
|
||||
**CodexLens 的 embeddings 架构有缺陷**:
|
||||
|
||||
```python
|
||||
# 期望行为
|
||||
for each _index.db in project:
|
||||
generate_embeddings(index_db)
|
||||
|
||||
# 实际行为
|
||||
generate_embeddings(root_index_db_only)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Init 返回信息缺陷
|
||||
|
||||
### 当前 `init` 的返回
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw"
|
||||
}
|
||||
```
|
||||
|
||||
**问题**:
|
||||
- ❌ 没有说明索引了多少文件
|
||||
- ❌ 没有说明是否生成了 embeddings
|
||||
- ❌ 没有说明 embeddings 覆盖率
|
||||
|
||||
### 应该返回的信息
|
||||
|
||||
```json
|
||||
{
|
||||
"success": true,
|
||||
"message": "Index created successfully",
|
||||
"stats": {
|
||||
"total_files": 303,
|
||||
"total_directories": 26,
|
||||
"index_databases": 26,
|
||||
"fts_coverage": {
|
||||
"files": 303,
|
||||
"percentage": 100.0
|
||||
},
|
||||
"embeddings_coverage": {
|
||||
"files": 5,
|
||||
"chunks": 10,
|
||||
"percentage": 1.6,
|
||||
"warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage."
|
||||
},
|
||||
"features": {
|
||||
"exact_fts": true,
|
||||
"fuzzy_fts": false,
|
||||
"vector_search": "partial"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 解决方案
|
||||
|
||||
### 方案 1:递归生成 Embeddings(推荐)
|
||||
|
||||
```bash
|
||||
# 为所有子目录生成 embeddings
|
||||
find .codexlens/indexes -name "_index.db" -exec \
|
||||
python -m codexlens embeddings-generate {} --force \;
|
||||
```
|
||||
|
||||
### 方案 2:改进 Init 命令
|
||||
|
||||
```python
|
||||
# codexlens/cli.py
|
||||
def init_with_embeddings(project_root):
|
||||
"""Initialize with recursive embeddings generation"""
|
||||
# 1. Build FTS indexes (current behavior)
|
||||
build_indexes(project_root)
|
||||
|
||||
# 2. Generate embeddings for ALL subdirs
|
||||
for index_db in find_all_index_dbs(project_root):
|
||||
if has_semantic_deps():
|
||||
generate_embeddings(index_db)
|
||||
|
||||
# 3. Return comprehensive stats
|
||||
return {
|
||||
"fts_coverage": get_fts_stats(),
|
||||
"embeddings_coverage": get_embeddings_stats(),
|
||||
"features": detect_features()
|
||||
}
|
||||
```
|
||||
|
||||
### 方案 3:Smart Search 路由改进
|
||||
|
||||
```python
|
||||
# 当前逻辑
|
||||
def classify_intent(query, hasIndex):
|
||||
if not hasIndex:
|
||||
return "ripgrep"
|
||||
elif is_natural_language(query):
|
||||
return "hybrid" # ❌ 但只有5个文件有embeddings!
|
||||
else:
|
||||
return "exact"
|
||||
|
||||
# 改进逻辑
|
||||
def classify_intent(query, indexStatus):
|
||||
embeddings_coverage = indexStatus.embeddings_coverage_percent
|
||||
|
||||
if embeddings_coverage < 50:
|
||||
# 如果覆盖率<50%,即使是自然语言也降级到exact
|
||||
return "exact" if indexStatus.indexed else "ripgrep"
|
||||
elif is_natural_language(query):
|
||||
return "hybrid"
|
||||
else:
|
||||
return "exact"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 验证用户质疑
|
||||
|
||||
### ❓ 为什么不为代码文件生成 embeddings?
|
||||
|
||||
**答**:不是"不为代码文件生成",而是:
|
||||
- ✅ 代码文件都有 FTS 索引
|
||||
- ❌ `embeddings-generate` 命令有BUG,**只处理根目录**
|
||||
- ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表**
|
||||
|
||||
### ❓ FTS 和 Vector 应该索引相同内容
|
||||
|
||||
**答**:**完全正确!** 当前实际情况:
|
||||
- FTS: 303/303 (100%)
|
||||
- Vector: 5/303 (1.6%)
|
||||
|
||||
**这是严重的不一致性,违背了设计原则。**
|
||||
|
||||
### ❓ Init 应该返回索引概况
|
||||
|
||||
**答**:**完全正确!** 当前 init 只返回简单成功消息,应该返回:
|
||||
- FTS 索引统计
|
||||
- Embeddings 覆盖率
|
||||
- 功能特性状态
|
||||
- 警告信息(如果覆盖不完整)
|
||||
|
||||
---
|
||||
|
||||
## 测试验证
|
||||
|
||||
### Hybrid Search 的实际效果
|
||||
|
||||
```javascript
|
||||
// 当前查询
|
||||
smart_search(query="authentication patterns", mode="hybrid")
|
||||
|
||||
// 实际搜索范围:
|
||||
// ✅ 可搜索的文件:5个(根目录的.md文件)
|
||||
// ❌ 不可搜索的文件:298个代码文件
|
||||
// 结果:返回的都是文档文件,代码文件被忽略
|
||||
```
|
||||
|
||||
### 修复后的效果(理想状态)
|
||||
|
||||
```javascript
|
||||
// 修复后
|
||||
smart_search(query="authentication patterns", mode="hybrid")
|
||||
|
||||
// 实际搜索范围:
|
||||
// ✅ 可搜索的文件:303个(所有文件)
|
||||
// 结果:包含代码文件和文档文件的综合结果
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 建议的修复优先级
|
||||
|
||||
### P0 - 紧急修复
|
||||
|
||||
1. **修复 `embeddings-generate` 命令**
|
||||
- 递归处理所有子目录的 _index.db
|
||||
- 为每个 _index.db 创建 semantic_chunks 表
|
||||
|
||||
2. **改进 `init` 返回信息**
|
||||
- 返回详细的索引统计
|
||||
- 显示 embeddings 覆盖率
|
||||
- 如果覆盖不完整,给出警告
|
||||
|
||||
### P1 - 重要改进
|
||||
|
||||
3. **Smart Search 自适应路由**
|
||||
- 检查 embeddings 覆盖率
|
||||
- 如果覆盖率低,自动降级到 exact 模式
|
||||
|
||||
4. **Status 命令增强**
|
||||
- 显示每个子目录的索引状态
|
||||
- 显示 embeddings 分布情况
|
||||
|
||||
---
|
||||
|
||||
## 临时解决方案
|
||||
|
||||
### 当前推荐使用方式
|
||||
|
||||
```javascript
|
||||
// 1. 文档搜索 - 使用 hybrid(有embeddings)
|
||||
smart_search(query="architecture design patterns", mode="hybrid")
|
||||
|
||||
// 2. 代码搜索 - 使用 exact(无embeddings,但有FTS)
|
||||
smart_search(query="function executeQuery", mode="exact")
|
||||
|
||||
// 3. 快速搜索 - 使用 ripgrep(跨所有文件)
|
||||
smart_search(query="TODO", mode="ripgrep")
|
||||
```
|
||||
|
||||
### 完整覆盖的变通方案
|
||||
|
||||
```bash
|
||||
# 手动为所有子目录生成 embeddings(如果CodexLens支持)
|
||||
cd D:\Claude_dms3\ccw
|
||||
|
||||
# 为每个子目录分别运行
|
||||
python -m codexlens embeddings-generate ./src/tools --force
|
||||
python -m codexlens embeddings-generate ./src/commands --force
|
||||
# ... 重复26次
|
||||
|
||||
# 或使用脚本自动化
|
||||
python check_embeddings.py --generate-all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 总结
|
||||
|
||||
| 用户质疑 | 状态 | 结论 |
|
||||
|---------|------|------|
|
||||
| 为什么不对代码生成embeddings? | ✅ 正确 | 是BUG,不是设计 |
|
||||
| FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 |
|
||||
| Init应返回详细概况 | ✅ 正确 | 当前信息不足 |
|
||||
|
||||
**用户的所有质疑都是正确的,揭示了 CodexLens 的三个核心问题:**
|
||||
|
||||
1. **Embeddings 生成不完整**(只有1.6%覆盖率)
|
||||
2. **索引一致性问题**(FTS vs Vector)
|
||||
3. **返回信息不透明**(缺少统计数据)
|
||||
|
||||
---
|
||||
|
||||
**生成时间**:2025-12-17
|
||||
**验证方法**:`python check_embeddings.py`
|
||||
Reference in New Issue
Block a user