Fix CodexLens embeddings generation to achieve 100% coverage

Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files).
This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage
with 2,042 semantic chunks across all 303 files in 26 index databases.

Key improvements:

1. **Recursive embeddings generation** (embedding_manager.py):
   - Add generate_embeddings_recursive() to process all _index.db files in directory tree
   - Add get_embeddings_status() for comprehensive coverage statistics
   - Add discover_all_index_dbs() helper for recursive file discovery

2. **Enhanced CLI commands** (commands.py):
   - embeddings-generate: Add --recursive flag for full project coverage
   - init: Use recursive generation by default for complete indexing
   - status: Display embeddings coverage statistics with 50% threshold

3. **Smart search routing improvements** (smart-search.ts):
   - Add 50% embeddings coverage threshold for hybrid mode routing
   - Auto-fallback to exact mode when coverage insufficient
   - Strip ANSI color codes from JSON output for correct parsing
   - Add embeddings_coverage_percent to IndexStatus and SearchMetadata
   - Provide clear warnings with actionable suggestions

4. **Documentation and analysis**:
   - Add SMART_SEARCH_ANALYSIS.md with initial investigation
   - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue
   - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary
   - Add check_embeddings.py script for coverage verification

Results:
- Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase
- Semantic chunks increased from 10 to 2,042 - 204x increase
- All 26 subdirectory indexes now have embeddings vs just 1

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
catlog22
2025-12-17 17:54:33 +08:00
parent d06a3ca12e
commit 74a830694c
7 changed files with 1540 additions and 346 deletions

View File

@@ -0,0 +1,165 @@
# CodexLens Embeddings 修复总结
## 修复成果
### ✅ 已完成
1. **递归 embeddings 生成功能** (`embedding_manager.py`)
- 添加 `generate_embeddings_recursive()` 函数
- 添加 `get_embeddings_status()` 函数
- 递归处理所有子目录的 _index.db 文件
2. **CLI 命令增强** (`commands.py`)
- `embeddings-generate` 添加 `--recursive` 标志
- `init` 命令使用递归生成(自动处理所有子目录)
- `status` 命令显示 embeddings 覆盖率统计
3. **Smart Search 智能路由** (`smart-search.ts`)
- 添加 50% 覆盖率阈值
- embeddings 不足时自动降级到 exact 模式
- 提供明确的警告信息
- Strip ANSI 颜色码以正确解析 JSON
### ✅ 测试结果
**CCW 项目 (d:\Claude_dms3\ccw)**:
- 索引数据库26 个
- 文件总数303
- Embeddings 覆盖:**100%** (所有 303 个文件)
- 生成 chunks**2,042** (之前只有 10)
**对比**:
| 指标 | 修复前 | 修复后 | 改进 |
|------|--------|--------|------|
| 覆盖率 | 1.6% (5/303) | 100% (303/303) | **62.5x** |
| Chunks | 10 | 2,042 | **204x** |
| 有 embeddings 的索引 | 1/26 | 26/26 | **26x** |
## 当前问题
### ⚠️ 遗留问题
1. **路径映射问题**
- `embeddings-generate --recursive` 需要使用索引路径而非源路径
- 用户应该能够使用源路径(`d:\Claude_dms3\ccw`
- 当前需要使用:`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw`
2. **Status 命令的全局 vs 项目级别**
- `codexlens status` 返回全局统计(所有项目)
- 需要项目级别的 embeddings 状态
- `embeddings-status` 只检查单个 _index.db不递归
## 建议的后续修复
### P1 - 路径映射修复
修改 `commands.py` 中的 `embeddings_generate` 命令line 1996-2000
```python
elif target_path.is_dir():
if recursive:
# Recursive mode: Map source path to index root
registry = RegistryStore()
try:
registry.initialize()
mapper = PathMapper()
index_db_path = mapper.source_to_index_db(target_path)
index_root = index_db_path.parent # Use index directory root
use_recursive = True
finally:
registry.close()
```
### P2 - 项目级别 Status
选项 A扩展 `embeddings-status` 命令支持递归
```bash
codexlens embeddings-status . --recursive --json
```
选项 B修改 `status` 命令接受路径参数
```bash
codexlens status --project . --json
```
## 使用指南
### 当前工作流程
**生成 embeddings完整覆盖**:
```bash
# 方法 1: 使用索引路径(当前工作方式)
cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
python -m codexlens embeddings-generate . --recursive --force --model fast
# 方法 2: init 命令(自动递归,推荐)
cd d:\Claude_dms3\ccw
python -m codexlens init . --force
```
**检查覆盖率**:
```bash
# 项目根目录
cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
python check_embeddings.py # 显示详细的每目录统计
# 全局状态
python -m codexlens status --json # 所有项目的汇总
```
**Smart Search**:
```javascript
// MCP 工具调用
smart_search(query="authentication patterns")
// 现在会:
// 1. 检查 embeddings 覆盖率
// 2. 如果 >= 50%,使用 hybrid 模式
// 3. 如果 < 50%,降级到 exact 模式
// 4. 显示警告信息
```
### 最佳实践
1. **初始化项目时自动生成 embeddings**:
```bash
codexlens init /path/to/project --force
```
2. **定期重新生成以更新**:
```bash
codexlens embeddings-generate /index/path --recursive --force
```
3. **使用 fast 模型快速测试**:
```bash
codexlens embeddings-generate . --recursive --model fast
```
4. **使用 code 模型获得最佳质量**:
```bash
codexlens embeddings-generate . --recursive --model code
```
## 技术细节
### 文件修改清单
**Python (CodexLens)**:
- `codex-lens/src/codexlens/cli/embedding_manager.py` - 添加递归函数
- `codex-lens/src/codexlens/cli/commands.py` - 更新 init, status, embeddings-generate
**TypeScript (CCW)**:
- `ccw/src/tools/smart-search.ts` - 智能路由 + ANSI stripping
- `ccw/src/tools/codex-lens.ts` - (未修改,使用现有实现)
### 依赖版本
- CodexLens: 当前开发版本
- Fastembed: 已安装ONNX backend
- Models: fast (~80MB), code (~150MB)
---
**修复时间**: 2025-12-17
**验证状态**: ✅ 核心功能正常,遗留路径映射问题待修复

View File

@@ -0,0 +1,167 @@
# Smart Search 索引分析报告
## 问题
分析当前 `smart_search(action="init")` 是否进行了向量模型索引,还是仅进行了基础索引。
## 分析结果
### 1. Init 操作的默认行为
从代码分析来看,`smart_search(action="init")` 的行为如下:
**代码路径**`ccw/src/tools/smart-search.ts``ccw/src/tools/codex-lens.ts`
```typescript
// smart-search.ts: executeInitAction (第 297-323 行)
async function executeInitAction(params: Params): Promise<SearchResult> {
const { path = '.', languages } = params;
const args = ['init', path];
if (languages && languages.length > 0) {
args.push('--languages', languages.join(','));
}
const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
// ...
}
```
**关键发现**
- `smart_search(action="init")` 调用 `codexlens init` 命令
- **不传递** `--no-embeddings` 参数
- **不传递** `--embedding-model` 参数
### 2. CodexLens Init 的默认行为
根据 `codexlens init --help` 的输出:
> If semantic search dependencies are installed, **automatically generates embeddings** after indexing completes. Use --no-embeddings to skip this step.
**结论**
-`init` 命令**默认会**生成 embeddings如果安装了语义搜索依赖
- ❌ 当前实现**未生成**所有文件的 embeddings
### 3. 实际测试结果
#### 第一次 Init未生成 embeddings
```bash
$ smart_search(action="init", path="d:\\Claude_dms3\\ccw")
# 结果:索引了 303 个文件,但 vector_search: false
```
**原因分析**
虽然语义搜索依赖fastembed已安装但 init 过程中遇到警告:
```
Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate.
```
#### 手动生成 Embeddings 后
```bash
$ python -m codexlens embeddings-generate . --force --verbose
Processing 5 files...
- D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks
- D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks
- D:\Claude_dms3\ccw\README.md: 2 chunks
- D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks
- D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks
Total: 10 chunks, 5 files
Model: jinaai/jina-embeddings-v2-base-code (768 dimensions)
```
**关键发现**
- ⚠️ 只为 **5 个文档/配置文件**生成了 embeddings
- ⚠️ **未为 298 个代码文件**.ts, .js 等)生成 embeddings
- ✅ Embeddings 状态显示 `coverage_percent: 100.0`(但这是针对"应该生成 embeddings 的文件"而言)
#### Hybrid Search 测试
```bash
$ smart_search(query="authentication and authorization patterns", mode="hybrid")
# ✅ 成功返回 5 个结果,带有相似度分数
# ✅ 证明向量搜索功能可用
```
## 4. 索引类型对比
| 索引类型 | 当前状态 | 支持的文件 | 说明 |
|---------|---------|-----------|------|
| **Exact FTS** | ✅ 启用 | 所有 303 个文件 | 基于 SQLite FTS5 的全文搜索 |
| **Fuzzy FTS** | ❌ 未启用 | - | 模糊匹配搜索 |
| **Vector Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | 基于 fastembed 的语义搜索 |
| **Hybrid Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | RRF 融合exact + fuzzy + vector |
## 5. 为什么只有 5 个文件有 Embeddings
**可能的原因**
1. **文件类型过滤**CodexLens 可能只为文档文件(.md和配置文件生成 embeddings
2. **代码文件使用符号索引**:代码文件(.ts, .js可能依赖于符号提取而非文本 embeddings
3. **性能考虑**:生成 300+ 文件的 embeddings 需要大量时间和存储空间
## 6. 结论
### 当前 `smart_search(action="init")` 的行为:
✅ **会尝试**生成向量索引(如果语义依赖已安装)
⚠️ **实际只**为文档/配置文件生成 embeddings5/303 文件)
**支持** hybrid 模式搜索(对于有 embeddings 的文件)
**支持** exact 模式搜索(对于所有 303 个文件)
### 搜索模式智能路由:
```
用户查询 → auto 模式 → 决策树:
├─ 自然语言查询 + 有 embeddings → hybrid 模式RRF 融合)
├─ 简单查询 + 有索引 → exact 模式FTS
└─ 无索引 → ripgrep 模式(字面匹配)
```
## 7. 建议
### 如果需要完整的语义搜索支持:
```bash
# 方案 1检查是否所有代码文件都应该有 embeddings
python -m codexlens embeddings-status . --verbose
# 方案 2明确为代码文件生成 embeddings如果支持
# 需要查看 CodexLens 文档确认代码文件的语义索引策略
# 方案 3使用 hybrid 模式进行文档搜索exact 模式进行代码搜索
smart_search(query="架构设计", mode="hybrid") # 文档语义搜索
smart_search(query="function_name", mode="exact") # 代码精确搜索
```
### 当前最佳实践:
```javascript
// 1. 初始化索引(一次性)
smart_search(action="init", path=".")
// 2. 智能搜索(推荐使用 auto 模式)
smart_search(query="your query") // 自动选择最佳模式
// 3. 特定模式搜索
smart_search(query="natural language query", mode="hybrid") // 语义搜索
smart_search(query="exact_identifier", mode="exact") // 精确匹配
smart_search(query="quick literal", mode="ripgrep") // 快速字面搜索
```
## 8. 技术细节
### Embeddings 模型
- **模型**jinaai/jina-embeddings-v2-base-code
- **维度**768
- **大小**~150MB
- **后端**fastembed (ONNX-based)
### 索引存储
- **位置**`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db`
- **大小**122.57 MB
- **Schema 版本**5
- **文件数**303
- **目录数**26
---
**生成时间**2025-12-17
**CodexLens 版本**:从当前安装中检测

View File

@@ -0,0 +1,330 @@
# Smart Search 索引分析报告(修正版)
## 用户质疑
1. ❓ 为什么不为代码文件生成向量 embeddings
2. ❓ Exact FTS 和 Vector 索引内容应该一样才对
3. ❓ init 应该返回 FTS 和 vector 索引概况
**结论:用户的质疑 100% 正确!这是 CodexLens 的设计缺陷。**
---
## 真实情况
### 1. 分层索引架构
CodexLens 使用**分层目录索引**
```
D:\Claude_dms3\ccw\
├── _index.db ← 根目录索引5个文件
├── src/
│ ├── _index.db ← src目录索引2个文件
│ ├── tools/
│ │ └── _index.db ← tools子目录索引25个文件
│ └── ...
└── ... (总共 26 个 _index.db
```
### 2. 索引覆盖情况
| 目录 | 文件数 | FTS索引 | Embeddings |
|------|--------|---------|------------|
| **根目录** | 5 | ✅ | ✅ (10 chunks) |
| bin/ | 2 | ✅ | ❌ 无semantic_chunks表 |
| dist/ | 4 | ✅ | ❌ 无semantic_chunks表 |
| dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 |
| dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 |
| src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 |
| src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 |
| ... | ... | ... | ... |
| **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) |
### 3. 关键发现
```python
# 运行检查脚本的结果
Total index databases: 26
Directories with embeddings: 1 # ❌ 只有根目录!
Total files indexed: 303 # ✅ FTS索引完整
Total semantic chunks: 10 # ❌ 只有根目录的5个文件
```
**问题**
- ✅ **所有303个文件**都有 FTS 索引分布在26个_index.db中
-**只有5个文件**1.6%)有 vector embeddings
- ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构
---
## 为什么会这样?
### 原因分析
1. **`init` 操作**
```bash
codexlens init .
```
- ✅ 为所有303个文件创建 FTS 索引(分布式)
- ⚠️ 尝试生成 embeddings但遇到"Index already has 10 chunks"警告
- ❌ 只为根目录生成了 embeddings
2. **`embeddings-generate` 操作**
```bash
codexlens embeddings-generate . --force
```
- ❌ 只处理了根目录的 _index.db
- ❌ **未递归处理子目录的索引**
- 结果只有5个文档文件有 embeddings
### 设计问题
**CodexLens 的 embeddings 架构有缺陷**
```python
# 期望行为
for each _index.db in project:
generate_embeddings(index_db)
# 实际行为
generate_embeddings(root_index_db_only)
```
---
## Init 返回信息缺陷
### 当前 `init` 的返回
```json
{
"success": true,
"message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw"
}
```
**问题**
- ❌ 没有说明索引了多少文件
- ❌ 没有说明是否生成了 embeddings
- ❌ 没有说明 embeddings 覆盖率
### 应该返回的信息
```json
{
"success": true,
"message": "Index created successfully",
"stats": {
"total_files": 303,
"total_directories": 26,
"index_databases": 26,
"fts_coverage": {
"files": 303,
"percentage": 100.0
},
"embeddings_coverage": {
"files": 5,
"chunks": 10,
"percentage": 1.6,
"warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage."
},
"features": {
"exact_fts": true,
"fuzzy_fts": false,
"vector_search": "partial"
}
}
}
```
---
## 解决方案
### 方案 1递归生成 Embeddings推荐
```bash
# 为所有子目录生成 embeddings
find .codexlens/indexes -name "_index.db" -exec \
python -m codexlens embeddings-generate {} --force \;
```
### 方案 2改进 Init 命令
```python
# codexlens/cli.py
def init_with_embeddings(project_root):
"""Initialize with recursive embeddings generation"""
# 1. Build FTS indexes (current behavior)
build_indexes(project_root)
# 2. Generate embeddings for ALL subdirs
for index_db in find_all_index_dbs(project_root):
if has_semantic_deps():
generate_embeddings(index_db)
# 3. Return comprehensive stats
return {
"fts_coverage": get_fts_stats(),
"embeddings_coverage": get_embeddings_stats(),
"features": detect_features()
}
```
### 方案 3Smart Search 路由改进
```python
# 当前逻辑
def classify_intent(query, hasIndex):
if not hasIndex:
return "ripgrep"
elif is_natural_language(query):
return "hybrid" # ❌ 但只有5个文件有embeddings
else:
return "exact"
# 改进逻辑
def classify_intent(query, indexStatus):
embeddings_coverage = indexStatus.embeddings_coverage_percent
if embeddings_coverage < 50:
# 如果覆盖率<50%即使是自然语言也降级到exact
return "exact" if indexStatus.indexed else "ripgrep"
elif is_natural_language(query):
return "hybrid"
else:
return "exact"
```
---
## 验证用户质疑
### ❓ 为什么不为代码文件生成 embeddings
**答**:不是"不为代码文件生成",而是:
- ✅ 代码文件都有 FTS 索引
- ❌ `embeddings-generate` 命令有BUG**只处理根目录**
- ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表**
### ❓ FTS 和 Vector 应该索引相同内容
**答****完全正确!** 当前实际情况:
- FTS: 303/303 (100%)
- Vector: 5/303 (1.6%)
**这是严重的不一致性,违背了设计原则。**
### ❓ Init 应该返回索引概况
**答****完全正确!** 当前 init 只返回简单成功消息,应该返回:
- FTS 索引统计
- Embeddings 覆盖率
- 功能特性状态
- 警告信息(如果覆盖不完整)
---
## 测试验证
### Hybrid Search 的实际效果
```javascript
// 当前查询
smart_search(query="authentication patterns", mode="hybrid")
// 实际搜索范围:
// ✅ 可搜索的文件5个根目录的.md文件
// ❌ 不可搜索的文件298个代码文件
// 结果:返回的都是文档文件,代码文件被忽略
```
### 修复后的效果(理想状态)
```javascript
// 修复后
smart_search(query="authentication patterns", mode="hybrid")
// 实际搜索范围:
// ✅ 可搜索的文件303个所有文件
// 结果:包含代码文件和文档文件的综合结果
```
---
## 建议的修复优先级
### P0 - 紧急修复
1. **修复 `embeddings-generate` 命令**
- 递归处理所有子目录的 _index.db
- 为每个 _index.db 创建 semantic_chunks 表
2. **改进 `init` 返回信息**
- 返回详细的索引统计
- 显示 embeddings 覆盖率
- 如果覆盖不完整,给出警告
### P1 - 重要改进
3. **Smart Search 自适应路由**
- 检查 embeddings 覆盖率
- 如果覆盖率低,自动降级到 exact 模式
4. **Status 命令增强**
- 显示每个子目录的索引状态
- 显示 embeddings 分布情况
---
## 临时解决方案
### 当前推荐使用方式
```javascript
// 1. 文档搜索 - 使用 hybrid有embeddings
smart_search(query="architecture design patterns", mode="hybrid")
// 2. 代码搜索 - 使用 exact无embeddings但有FTS
smart_search(query="function executeQuery", mode="exact")
// 3. 快速搜索 - 使用 ripgrep跨所有文件
smart_search(query="TODO", mode="ripgrep")
```
### 完整覆盖的变通方案
```bash
# 手动为所有子目录生成 embeddings如果CodexLens支持
cd D:\Claude_dms3\ccw
# 为每个子目录分别运行
python -m codexlens embeddings-generate ./src/tools --force
python -m codexlens embeddings-generate ./src/commands --force
# ... 重复26次
# 或使用脚本自动化
python check_embeddings.py --generate-all
```
---
## 总结
| 用户质疑 | 状态 | 结论 |
|---------|------|------|
| 为什么不对代码生成embeddings | ✅ 正确 | 是BUG不是设计 |
| FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 |
| Init应返回详细概况 | ✅ 正确 | 当前信息不足 |
**用户的所有质疑都是正确的,揭示了 CodexLens 的三个核心问题:**
1. **Embeddings 生成不完整**只有1.6%覆盖率)
2. **索引一致性问题**FTS vs Vector
3. **返回信息不透明**(缺少统计数据)
---
**生成时间**2025-12-17
**验证方法**`python check_embeddings.py`

47
ccw/check_embeddings.py Normal file
View File

@@ -0,0 +1,47 @@
import sqlite3
import os
# Find all _index.db files
root_dir = r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw'
index_files = []
for dirpath, dirnames, filenames in os.walk(root_dir):
if '_index.db' in filenames:
index_files.append(os.path.join(dirpath, '_index.db'))
print(f'Found {len(index_files)} index databases\n')
total_files = 0
total_chunks = 0
dirs_with_chunks = 0
for db_path in sorted(index_files):
rel_path = db_path.replace(r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\\', '')
conn = sqlite3.connect(db_path)
try:
cursor = conn.execute('SELECT COUNT(*) FROM files')
file_count = cursor.fetchone()[0]
total_files += file_count
try:
cursor = conn.execute('SELECT COUNT(*) FROM semantic_chunks')
chunk_count = cursor.fetchone()[0]
total_chunks += chunk_count
if chunk_count > 0:
dirs_with_chunks += 1
print(f'[+] {rel_path:<40} Files: {file_count:3d} Chunks: {chunk_count:3d}')
else:
print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no chunks)')
except sqlite3.OperationalError:
print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no semantic_chunks table)')
except Exception as e:
print(f'[!] {rel_path:<40} Error: {e}')
finally:
conn.close()
print(f'\n=== Summary ===')
print(f'Total index databases: {len(index_files)}')
print(f'Directories with embeddings: {dirs_with_chunks}')
print(f'Total files indexed: {total_files}')
print(f'Total semantic chunks: {total_chunks}')

File diff suppressed because it is too large Load Diff

View File

@@ -142,11 +142,11 @@ def init(
if not no_embeddings:
try:
from codexlens.semantic import SEMANTIC_AVAILABLE
from codexlens.cli.embedding_manager import generate_embeddings
from codexlens.cli.embedding_manager import generate_embeddings_recursive, get_embeddings_status
if SEMANTIC_AVAILABLE:
# Find the index file
index_path = Path(build_result.index_root) / "_index.db"
# Use the index root directory (not the _index.db file)
index_root = Path(build_result.index_root)
if not json_mode:
console.print("\n[bold]Generating embeddings...[/bold]")
@@ -157,8 +157,8 @@ def init(
if not json_mode and verbose:
console.print(f" {msg}")
embed_result = generate_embeddings(
index_path,
embed_result = generate_embeddings_recursive(
index_root,
model_profile=embedding_model,
force=False, # Don't force regenerate during init
chunk_size=2000,
@@ -167,29 +167,56 @@ def init(
if embed_result["success"]:
embed_data = embed_result["result"]
result["embeddings_generated"] = True
result["embeddings_count"] = embed_data["chunks_embedded"]
# Get comprehensive coverage statistics
status_result = get_embeddings_status(index_root)
if status_result["success"]:
coverage = status_result["result"]
result["embeddings"] = {
"generated": True,
"total_indexes": coverage["total_indexes"],
"total_files": coverage["total_files"],
"files_with_embeddings": coverage["files_with_embeddings"],
"coverage_percent": coverage["coverage_percent"],
"total_chunks": coverage["total_chunks"],
}
else:
result["embeddings"] = {
"generated": True,
"total_chunks": embed_data["total_chunks_created"],
"files_processed": embed_data["total_files_processed"],
}
if not json_mode:
console.print(f"[green]✓[/green] Generated [bold]{embed_data['chunks_embedded']}[/bold] embeddings in {embed_data['elapsed_time']:.1f}s")
console.print(f"[green]✓[/green] Generated embeddings for [bold]{embed_data['total_files_processed']}[/bold] files")
console.print(f" Total chunks: [bold]{embed_data['total_chunks_created']}[/bold]")
console.print(f" Indexes processed: [bold]{embed_data['indexes_successful']}/{embed_data['indexes_processed']}[/bold]")
else:
if not json_mode:
console.print(f"[yellow]Warning:[/yellow] Embedding generation failed: {embed_result.get('error', 'Unknown error')}")
result["embeddings_generated"] = False
result["embeddings_error"] = embed_result.get("error")
result["embeddings"] = {
"generated": False,
"error": embed_result.get("error"),
}
else:
if not json_mode and verbose:
console.print("[dim]Semantic search not available. Skipping embeddings.[/dim]")
result["embeddings_generated"] = False
result["embeddings_error"] = "Semantic dependencies not installed"
result["embeddings"] = {
"generated": False,
"error": "Semantic dependencies not installed",
}
except Exception as e:
if not json_mode and verbose:
console.print(f"[yellow]Warning:[/yellow] Could not generate embeddings: {e}")
result["embeddings_generated"] = False
result["embeddings_error"] = str(e)
result["embeddings"] = {
"generated": False,
"error": str(e),
}
else:
result["embeddings_generated"] = False
result["embeddings_error"] = "Skipped (--no-embeddings)"
result["embeddings"] = {
"generated": False,
"error": "Skipped (--no-embeddings)",
}
except StorageError as exc:
if json_mode:
@@ -611,6 +638,24 @@ def status(
except Exception:
pass
# Check embeddings coverage
embeddings_info = None
has_vector_search = False
try:
from codexlens.cli.embedding_manager import get_embeddings_status
if index_root.exists():
embed_status = get_embeddings_status(index_root)
if embed_status["success"]:
embeddings_info = embed_status["result"]
# Enable vector search if coverage >= 50%
has_vector_search = embeddings_info["coverage_percent"] >= 50.0
except ImportError:
# Embedding manager not available
pass
except Exception as e:
logger.debug(f"Failed to get embeddings status: {e}")
stats = {
"index_root": str(index_root),
"registry_path": str(_get_registry_path()),
@@ -624,9 +669,13 @@ def status(
"exact_fts": True, # Always available
"fuzzy_fts": has_dual_fts,
"hybrid_search": has_dual_fts,
"vector_search": False, # Not yet implemented
"vector_search": has_vector_search,
},
}
# Add embeddings info if available
if embeddings_info:
stats["embeddings"] = embeddings_info
if json_mode:
print_json(success=True, result=stats)
@@ -648,7 +697,20 @@ def status(
else:
console.print(f" Fuzzy FTS: ✗ (run 'migrate' to enable)")
console.print(f" Hybrid Search: ✗ (run 'migrate' to enable)")
console.print(f" Vector Search: ✗ (future)")
if has_vector_search:
console.print(f" Vector Search: ✓ (embeddings available)")
else:
console.print(f" Vector Search: ✗ (no embeddings or coverage < 50%)")
# Display embeddings statistics if available
if embeddings_info:
console.print("\n[bold]Embeddings Coverage:[/bold]")
console.print(f" Total Indexes: {embeddings_info['total_indexes']}")
console.print(f" Total Files: {embeddings_info['total_files']}")
console.print(f" Files with Embeddings: {embeddings_info['files_with_embeddings']}")
console.print(f" Coverage: {embeddings_info['coverage_percent']:.1f}%")
console.print(f" Total Chunks: {embeddings_info['total_chunks']}")
except StorageError as exc:
if json_mode:
@@ -1885,6 +1947,12 @@ def embeddings_generate(
"--chunk-size",
help="Maximum chunk size in characters.",
),
recursive: bool = typer.Option(
False,
"--recursive",
"-r",
help="Recursively process all _index.db files in directory tree.",
),
json_mode: bool = typer.Option(False, "--json", help="Output JSON response."),
verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose output."),
) -> None:
@@ -1908,28 +1976,42 @@ def embeddings_generate(
_configure_logging(verbose)
try:
from codexlens.cli.embedding_manager import generate_embeddings
from codexlens.cli.embedding_manager import generate_embeddings, generate_embeddings_recursive
# Resolve path
target_path = path.expanduser().resolve()
# Determine if we should use recursive mode
use_recursive = False
index_path = None
index_root = None
if target_path.is_file() and target_path.name == "_index.db":
# Direct index file
index_path = target_path
if recursive:
# Use parent directory for recursive processing
use_recursive = True
index_root = target_path.parent
elif target_path.is_dir():
# Try to find index for this project
registry = RegistryStore()
try:
registry.initialize()
mapper = PathMapper()
index_path = mapper.source_to_index_db(target_path)
if recursive:
# Recursive mode: process all _index.db files in directory tree
use_recursive = True
index_root = target_path
else:
# Non-recursive: Try to find index for this project
registry = RegistryStore()
try:
registry.initialize()
mapper = PathMapper()
index_path = mapper.source_to_index_db(target_path)
if not index_path.exists():
console.print(f"[red]Error:[/red] No index found for {target_path}")
console.print("Run 'codexlens init' first to create an index")
raise typer.Exit(code=1)
finally:
registry.close()
if not index_path.exists():
console.print(f"[red]Error:[/red] No index found for {target_path}")
console.print("Run 'codexlens init' first to create an index")
raise typer.Exit(code=1)
finally:
registry.close()
else:
console.print(f"[red]Error:[/red] Path must be _index.db file or directory")
raise typer.Exit(code=1)
@@ -1940,16 +2022,29 @@ def embeddings_generate(
console.print(f" {msg}")
console.print(f"[bold]Generating embeddings[/bold]")
console.print(f"Index: [dim]{index_path}[/dim]")
if use_recursive:
console.print(f"Index root: [dim]{index_root}[/dim]")
console.print(f"Mode: [yellow]Recursive[/yellow]")
else:
console.print(f"Index: [dim]{index_path}[/dim]")
console.print(f"Model: [cyan]{model}[/cyan]\n")
result = generate_embeddings(
index_path,
model_profile=model,
force=force,
chunk_size=chunk_size,
progress_callback=progress_update,
)
if use_recursive:
result = generate_embeddings_recursive(
index_root,
model_profile=model,
force=force,
chunk_size=chunk_size,
progress_callback=progress_update,
)
else:
result = generate_embeddings(
index_path,
model_profile=model,
force=force,
chunk_size=chunk_size,
progress_callback=progress_update,
)
if json_mode:
print_json(**result)
@@ -1968,21 +2063,45 @@ def embeddings_generate(
raise typer.Exit(code=1)
data = result["result"]
elapsed = data["elapsed_time"]
console.print(f"[green]✓[/green] Embeddings generated successfully!")
console.print(f" Model: {data['model_name']}")
console.print(f" Chunks created: {data['chunks_created']:,}")
console.print(f" Files processed: {data['files_processed']}")
if use_recursive:
# Recursive mode output
console.print(f"[green]✓[/green] Recursive embeddings generation complete!")
console.print(f" Indexes processed: {data['indexes_processed']}")
console.print(f" Indexes successful: {data['indexes_successful']}")
if data['indexes_failed'] > 0:
console.print(f" [yellow]Indexes failed: {data['indexes_failed']}[/yellow]")
console.print(f" Total chunks created: {data['total_chunks_created']:,}")
console.print(f" Total files processed: {data['total_files_processed']}")
if data['total_files_failed'] > 0:
console.print(f" [yellow]Total files failed: {data['total_files_failed']}[/yellow]")
console.print(f" Model profile: {data['model_profile']}")
if data["files_failed"] > 0:
console.print(f" [yellow]Files failed: {data['files_failed']}[/yellow]")
if data["failed_files"]:
console.print(" [dim]First failures:[/dim]")
for file_path, error in data["failed_files"]:
console.print(f" [dim]{file_path}: {error}[/dim]")
# Show details if verbose
if verbose and data.get('details'):
console.print("\n[dim]Index details:[/dim]")
for detail in data['details']:
status_icon = "[green]✓[/green]" if detail['success'] else "[red]✗[/red]"
console.print(f" {status_icon} {detail['path']}")
if not detail['success'] and detail.get('error'):
console.print(f" [dim]Error: {detail['error']}[/dim]")
else:
# Single index mode output
elapsed = data["elapsed_time"]
console.print(f" Time: {elapsed:.1f}s")
console.print(f"[green]✓[/green] Embeddings generated successfully!")
console.print(f" Model: {data['model_name']}")
console.print(f" Chunks created: {data['chunks_created']:,}")
console.print(f" Files processed: {data['files_processed']}")
if data["files_failed"] > 0:
console.print(f" [yellow]Files failed: {data['files_failed']}[/yellow]")
if data["failed_files"]:
console.print(" [dim]First failures:[/dim]")
for file_path, error in data["failed_files"]:
console.print(f" [dim]{file_path}: {error}[/dim]")
console.print(f" Time: {elapsed:.1f}s")
console.print("\n[dim]Use vector search with:[/dim]")
console.print(" [cyan]codexlens search 'your query' --mode pure-vector[/cyan]")

View File

@@ -255,6 +255,21 @@ def generate_embeddings(
}
def discover_all_index_dbs(index_root: Path) -> List[Path]:
"""Recursively find all _index.db files in an index tree.
Args:
index_root: Root directory to scan for _index.db files
Returns:
Sorted list of paths to _index.db files
"""
if not index_root.exists():
return []
return sorted(index_root.rglob("_index.db"))
def find_all_indexes(scan_dir: Path) -> List[Path]:
"""Find all _index.db files in directory tree.
@@ -270,6 +285,146 @@ def find_all_indexes(scan_dir: Path) -> List[Path]:
return list(scan_dir.rglob("_index.db"))
def generate_embeddings_recursive(
index_root: Path,
model_profile: str = "code",
force: bool = False,
chunk_size: int = 2000,
progress_callback: Optional[callable] = None,
) -> Dict[str, any]:
"""Generate embeddings for all index databases in a project recursively.
Args:
index_root: Root index directory containing _index.db files
model_profile: Model profile (fast, code, multilingual, balanced)
force: If True, regenerate even if embeddings exist
chunk_size: Maximum chunk size in characters
progress_callback: Optional callback for progress updates
Returns:
Aggregated result dictionary with generation statistics
"""
# Discover all _index.db files
index_files = discover_all_index_dbs(index_root)
if not index_files:
return {
"success": False,
"error": f"No index databases found in {index_root}",
}
if progress_callback:
progress_callback(f"Found {len(index_files)} index databases to process")
# Process each index database
all_results = []
total_chunks = 0
total_files_processed = 0
total_files_failed = 0
for idx, index_path in enumerate(index_files, 1):
if progress_callback:
try:
rel_path = index_path.relative_to(index_root)
except ValueError:
rel_path = index_path
progress_callback(f"[{idx}/{len(index_files)}] Processing {rel_path}")
result = generate_embeddings(
index_path,
model_profile=model_profile,
force=force,
chunk_size=chunk_size,
progress_callback=None, # Don't cascade callbacks
)
all_results.append({
"path": str(index_path),
"success": result["success"],
"result": result.get("result"),
"error": result.get("error"),
})
if result["success"]:
data = result["result"]
total_chunks += data["chunks_created"]
total_files_processed += data["files_processed"]
total_files_failed += data["files_failed"]
successful = sum(1 for r in all_results if r["success"])
return {
"success": successful > 0,
"result": {
"indexes_processed": len(index_files),
"indexes_successful": successful,
"indexes_failed": len(index_files) - successful,
"total_chunks_created": total_chunks,
"total_files_processed": total_files_processed,
"total_files_failed": total_files_failed,
"model_profile": model_profile,
"details": all_results,
},
}
def get_embeddings_status(index_root: Path) -> Dict[str, any]:
"""Get comprehensive embeddings coverage status for all indexes.
Args:
index_root: Root index directory
Returns:
Aggregated status with coverage statistics
"""
index_files = discover_all_index_dbs(index_root)
if not index_files:
return {
"success": True,
"result": {
"total_indexes": 0,
"total_files": 0,
"files_with_embeddings": 0,
"files_without_embeddings": 0,
"total_chunks": 0,
"coverage_percent": 0.0,
"indexes_with_embeddings": 0,
"indexes_without_embeddings": 0,
},
}
total_files = 0
files_with_embeddings = 0
total_chunks = 0
indexes_with_embeddings = 0
for index_path in index_files:
status = check_index_embeddings(index_path)
if status["success"]:
result = status["result"]
total_files += result["total_files"]
files_with_embeddings += result["files_with_chunks"]
total_chunks += result["total_chunks"]
if result["has_embeddings"]:
indexes_with_embeddings += 1
return {
"success": True,
"result": {
"total_indexes": len(index_files),
"total_files": total_files,
"files_with_embeddings": files_with_embeddings,
"files_without_embeddings": total_files - files_with_embeddings,
"total_chunks": total_chunks,
"coverage_percent": round((files_with_embeddings / total_files * 100) if total_files > 0 else 0, 1),
"indexes_with_embeddings": indexes_with_embeddings,
"indexes_without_embeddings": len(index_files) - indexes_with_embeddings,
},
}
def get_embedding_stats_summary(index_root: Path) -> Dict[str, any]:
"""Get summary statistics for all indexes in root directory.