diff --git a/ccw/EMBEDDINGS_FIX_SUMMARY.md b/ccw/EMBEDDINGS_FIX_SUMMARY.md new file mode 100644 index 00000000..968ac66e --- /dev/null +++ b/ccw/EMBEDDINGS_FIX_SUMMARY.md @@ -0,0 +1,165 @@ +# CodexLens Embeddings 修复总结 + +## 修复成果 + +### ✅ 已完成 + +1. **递归 embeddings 生成功能** (`embedding_manager.py`) + - 添加 `generate_embeddings_recursive()` 函数 + - 添加 `get_embeddings_status()` 函数 + - 递归处理所有子目录的 _index.db 文件 + +2. **CLI 命令增强** (`commands.py`) + - `embeddings-generate` 添加 `--recursive` 标志 + - `init` 命令使用递归生成(自动处理所有子目录) + - `status` 命令显示 embeddings 覆盖率统计 + +3. **Smart Search 智能路由** (`smart-search.ts`) + - 添加 50% 覆盖率阈值 + - embeddings 不足时自动降级到 exact 模式 + - 提供明确的警告信息 + - Strip ANSI 颜色码以正确解析 JSON + +### ✅ 测试结果 + +**CCW 项目 (d:\Claude_dms3\ccw)**: +- 索引数据库:26 个 +- 文件总数:303 +- Embeddings 覆盖:**100%** (所有 303 个文件) +- 生成 chunks:**2,042** (之前只有 10) + +**对比**: +| 指标 | 修复前 | 修复后 | 改进 | +|------|--------|--------|------| +| 覆盖率 | 1.6% (5/303) | 100% (303/303) | **62.5x** | +| Chunks | 10 | 2,042 | **204x** | +| 有 embeddings 的索引 | 1/26 | 26/26 | **26x** | + +## 当前问题 + +### ⚠️ 遗留问题 + +1. **路径映射问题** + - `embeddings-generate --recursive` 需要使用索引路径而非源路径 + - 用户应该能够使用源路径(`d:\Claude_dms3\ccw`) + - 当前需要使用:`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw` + +2. **Status 命令的全局 vs 项目级别** + - `codexlens status` 返回全局统计(所有项目) + - 需要项目级别的 embeddings 状态 + - `embeddings-status` 只检查单个 _index.db,不递归 + +## 建议的后续修复 + +### P1 - 路径映射修复 + +修改 `commands.py` 中的 `embeddings_generate` 命令(line 1996-2000): + +```python +elif target_path.is_dir(): + if recursive: + # Recursive mode: Map source path to index root + registry = RegistryStore() + try: + registry.initialize() + mapper = PathMapper() + index_db_path = mapper.source_to_index_db(target_path) + index_root = index_db_path.parent # Use index directory root + use_recursive = True + finally: + registry.close() +``` + +### P2 - 项目级别 Status + +选项 A:扩展 `embeddings-status` 命令支持递归 +```bash +codexlens embeddings-status . --recursive --json +``` + +选项 B:修改 `status` 命令接受路径参数 +```bash +codexlens status --project . --json +``` + +## 使用指南 + +### 当前工作流程 + +**生成 embeddings(完整覆盖)**: +```bash +# 方法 1: 使用索引路径(当前工作方式) +cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw +python -m codexlens embeddings-generate . --recursive --force --model fast + +# 方法 2: init 命令(自动递归,推荐) +cd d:\Claude_dms3\ccw +python -m codexlens init . --force +``` + +**检查覆盖率**: +```bash +# 项目根目录 +cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw +python check_embeddings.py # 显示详细的每目录统计 + +# 全局状态 +python -m codexlens status --json # 所有项目的汇总 +``` + +**Smart Search**: +```javascript +// MCP 工具调用 +smart_search(query="authentication patterns") + +// 现在会: +// 1. 检查 embeddings 覆盖率 +// 2. 如果 >= 50%,使用 hybrid 模式 +// 3. 如果 < 50%,降级到 exact 模式 +// 4. 显示警告信息 +``` + +### 最佳实践 + +1. **初始化项目时自动生成 embeddings**: + ```bash + codexlens init /path/to/project --force + ``` + +2. **定期重新生成以更新**: + ```bash + codexlens embeddings-generate /index/path --recursive --force + ``` + +3. **使用 fast 模型快速测试**: + ```bash + codexlens embeddings-generate . --recursive --model fast + ``` + +4. **使用 code 模型获得最佳质量**: + ```bash + codexlens embeddings-generate . --recursive --model code + ``` + +## 技术细节 + +### 文件修改清单 + +**Python (CodexLens)**: +- `codex-lens/src/codexlens/cli/embedding_manager.py` - 添加递归函数 +- `codex-lens/src/codexlens/cli/commands.py` - 更新 init, status, embeddings-generate + +**TypeScript (CCW)**: +- `ccw/src/tools/smart-search.ts` - 智能路由 + ANSI stripping +- `ccw/src/tools/codex-lens.ts` - (未修改,使用现有实现) + +### 依赖版本 + +- CodexLens: 当前开发版本 +- Fastembed: 已安装(ONNX backend) +- Models: fast (~80MB), code (~150MB) + +--- + +**修复时间**: 2025-12-17 +**验证状态**: ✅ 核心功能正常,遗留路径映射问题待修复 diff --git a/ccw/SMART_SEARCH_ANALYSIS.md b/ccw/SMART_SEARCH_ANALYSIS.md new file mode 100644 index 00000000..fad60b3e --- /dev/null +++ b/ccw/SMART_SEARCH_ANALYSIS.md @@ -0,0 +1,167 @@ +# Smart Search 索引分析报告 + +## 问题 +分析当前 `smart_search(action="init")` 是否进行了向量模型索引,还是仅进行了基础索引。 + +## 分析结果 + +### 1. Init 操作的默认行为 + +从代码分析来看,`smart_search(action="init")` 的行为如下: + +**代码路径**:`ccw/src/tools/smart-search.ts` → `ccw/src/tools/codex-lens.ts` + +```typescript +// smart-search.ts: executeInitAction (第 297-323 行) +async function executeInitAction(params: Params): Promise { + const { path = '.', languages } = params; + const args = ['init', path]; + if (languages && languages.length > 0) { + args.push('--languages', languages.join(',')); + } + const result = await executeCodexLens(args, { cwd: path, timeout: 300000 }); + // ... +} +``` + +**关键发现**: +- `smart_search(action="init")` 调用 `codexlens init` 命令 +- **不传递** `--no-embeddings` 参数 +- **不传递** `--embedding-model` 参数 + +### 2. CodexLens Init 的默认行为 + +根据 `codexlens init --help` 的输出: + +> If semantic search dependencies are installed, **automatically generates embeddings** after indexing completes. Use --no-embeddings to skip this step. + +**结论**: +- ✅ `init` 命令**默认会**生成 embeddings(如果安装了语义搜索依赖) +- ❌ 当前实现**未生成**所有文件的 embeddings + +### 3. 实际测试结果 + +#### 第一次 Init(未生成 embeddings) +```bash +$ smart_search(action="init", path="d:\\Claude_dms3\\ccw") +# 结果:索引了 303 个文件,但 vector_search: false +``` + +**原因分析**: +虽然语义搜索依赖(fastembed)已安装,但 init 过程中遇到警告: +``` +Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate. +``` + +#### 手动生成 Embeddings 后 +```bash +$ python -m codexlens embeddings-generate . --force --verbose + +Processing 5 files... +- D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks +- D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks +- D:\Claude_dms3\ccw\README.md: 2 chunks +- D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks +- D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks + +Total: 10 chunks, 5 files +Model: jinaai/jina-embeddings-v2-base-code (768 dimensions) +``` + +**关键发现**: +- ⚠️ 只为 **5 个文档/配置文件**生成了 embeddings +- ⚠️ **未为 298 个代码文件**(.ts, .js 等)生成 embeddings +- ✅ Embeddings 状态显示 `coverage_percent: 100.0`(但这是针对"应该生成 embeddings 的文件"而言) + +#### Hybrid Search 测试 +```bash +$ smart_search(query="authentication and authorization patterns", mode="hybrid") +# ✅ 成功返回 5 个结果,带有相似度分数 +# ✅ 证明向量搜索功能可用 +``` + +## 4. 索引类型对比 + +| 索引类型 | 当前状态 | 支持的文件 | 说明 | +|---------|---------|-----------|------| +| **Exact FTS** | ✅ 启用 | 所有 303 个文件 | 基于 SQLite FTS5 的全文搜索 | +| **Fuzzy FTS** | ❌ 未启用 | - | 模糊匹配搜索 | +| **Vector Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | 基于 fastembed 的语义搜索 | +| **Hybrid Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | RRF 融合(exact + fuzzy + vector) | + +## 5. 为什么只有 5 个文件有 Embeddings? + +**可能的原因**: + +1. **文件类型过滤**:CodexLens 可能只为文档文件(.md)和配置文件生成 embeddings +2. **代码文件使用符号索引**:代码文件(.ts, .js)可能依赖于符号提取而非文本 embeddings +3. **性能考虑**:生成 300+ 文件的 embeddings 需要大量时间和存储空间 + +## 6. 结论 + +### 当前 `smart_search(action="init")` 的行为: + +✅ **会尝试**生成向量索引(如果语义依赖已安装) +⚠️ **实际只**为文档/配置文件生成 embeddings(5/303 文件) +✅ **支持** hybrid 模式搜索(对于有 embeddings 的文件) +✅ **支持** exact 模式搜索(对于所有 303 个文件) + +### 搜索模式智能路由: + +``` +用户查询 → auto 模式 → 决策树: + ├─ 自然语言查询 + 有 embeddings → hybrid 模式(RRF 融合) + ├─ 简单查询 + 有索引 → exact 模式(FTS) + └─ 无索引 → ripgrep 模式(字面匹配) +``` + +## 7. 建议 + +### 如果需要完整的语义搜索支持: + +```bash +# 方案 1:检查是否所有代码文件都应该有 embeddings +python -m codexlens embeddings-status . --verbose + +# 方案 2:明确为代码文件生成 embeddings(如果支持) +# 需要查看 CodexLens 文档确认代码文件的语义索引策略 + +# 方案 3:使用 hybrid 模式进行文档搜索,exact 模式进行代码搜索 +smart_search(query="架构设计", mode="hybrid") # 文档语义搜索 +smart_search(query="function_name", mode="exact") # 代码精确搜索 +``` + +### 当前最佳实践: + +```javascript +// 1. 初始化索引(一次性) +smart_search(action="init", path=".") + +// 2. 智能搜索(推荐使用 auto 模式) +smart_search(query="your query") // 自动选择最佳模式 + +// 3. 特定模式搜索 +smart_search(query="natural language query", mode="hybrid") // 语义搜索 +smart_search(query="exact_identifier", mode="exact") // 精确匹配 +smart_search(query="quick literal", mode="ripgrep") // 快速字面搜索 +``` + +## 8. 技术细节 + +### Embeddings 模型 +- **模型**:jinaai/jina-embeddings-v2-base-code +- **维度**:768 +- **大小**:~150MB +- **后端**:fastembed (ONNX-based) + +### 索引存储 +- **位置**:`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db` +- **大小**:122.57 MB +- **Schema 版本**:5 +- **文件数**:303 +- **目录数**:26 + +--- + +**生成时间**:2025-12-17 +**CodexLens 版本**:从当前安装中检测 diff --git a/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md b/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md new file mode 100644 index 00000000..6f5229bc --- /dev/null +++ b/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md @@ -0,0 +1,330 @@ +# Smart Search 索引分析报告(修正版) + +## 用户质疑 + +1. ❓ 为什么不为代码文件生成向量 embeddings? +2. ❓ Exact FTS 和 Vector 索引内容应该一样才对 +3. ❓ init 应该返回 FTS 和 vector 索引概况 + +**结论:用户的质疑 100% 正确!这是 CodexLens 的设计缺陷。** + +--- + +## 真实情况 + +### 1. 分层索引架构 + +CodexLens 使用**分层目录索引**: + +``` +D:\Claude_dms3\ccw\ +├── _index.db ← 根目录索引(5个文件) +├── src/ +│ ├── _index.db ← src目录索引(2个文件) +│ ├── tools/ +│ │ └── _index.db ← tools子目录索引(25个文件) +│ └── ... +└── ... (总共 26 个 _index.db) +``` + +### 2. 索引覆盖情况 + +| 目录 | 文件数 | FTS索引 | Embeddings | +|------|--------|---------|------------| +| **根目录** | 5 | ✅ | ✅ (10 chunks) | +| bin/ | 2 | ✅ | ❌ 无semantic_chunks表 | +| dist/ | 4 | ✅ | ❌ 无semantic_chunks表 | +| dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 | +| dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 | +| src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 | +| src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 | +| ... | ... | ... | ... | +| **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) | + +### 3. 关键发现 + +```python +# 运行检查脚本的结果 +Total index databases: 26 +Directories with embeddings: 1 # ❌ 只有根目录! +Total files indexed: 303 # ✅ FTS索引完整 +Total semantic chunks: 10 # ❌ 只有根目录的5个文件 +``` + +**问题**: +- ✅ **所有303个文件**都有 FTS 索引(分布在26个_index.db中) +- ❌ **只有5个文件**(1.6%)有 vector embeddings +- ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构 + +--- + +## 为什么会这样? + +### 原因分析 + +1. **`init` 操作**: + ```bash + codexlens init . + ``` + - ✅ 为所有303个文件创建 FTS 索引(分布式) + - ⚠️ 尝试生成 embeddings,但遇到"Index already has 10 chunks"警告 + - ❌ 只为根目录生成了 embeddings + +2. **`embeddings-generate` 操作**: + ```bash + codexlens embeddings-generate . --force + ``` + - ❌ 只处理了根目录的 _index.db + - ❌ **未递归处理子目录的索引** + - 结果:只有5个文档文件有 embeddings + +### 设计问题 + +**CodexLens 的 embeddings 架构有缺陷**: + +```python +# 期望行为 +for each _index.db in project: + generate_embeddings(index_db) + +# 实际行为 +generate_embeddings(root_index_db_only) +``` + +--- + +## Init 返回信息缺陷 + +### 当前 `init` 的返回 + +```json +{ + "success": true, + "message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw" +} +``` + +**问题**: +- ❌ 没有说明索引了多少文件 +- ❌ 没有说明是否生成了 embeddings +- ❌ 没有说明 embeddings 覆盖率 + +### 应该返回的信息 + +```json +{ + "success": true, + "message": "Index created successfully", + "stats": { + "total_files": 303, + "total_directories": 26, + "index_databases": 26, + "fts_coverage": { + "files": 303, + "percentage": 100.0 + }, + "embeddings_coverage": { + "files": 5, + "chunks": 10, + "percentage": 1.6, + "warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage." + }, + "features": { + "exact_fts": true, + "fuzzy_fts": false, + "vector_search": "partial" + } + } +} +``` + +--- + +## 解决方案 + +### 方案 1:递归生成 Embeddings(推荐) + +```bash +# 为所有子目录生成 embeddings +find .codexlens/indexes -name "_index.db" -exec \ + python -m codexlens embeddings-generate {} --force \; +``` + +### 方案 2:改进 Init 命令 + +```python +# codexlens/cli.py +def init_with_embeddings(project_root): + """Initialize with recursive embeddings generation""" + # 1. Build FTS indexes (current behavior) + build_indexes(project_root) + + # 2. Generate embeddings for ALL subdirs + for index_db in find_all_index_dbs(project_root): + if has_semantic_deps(): + generate_embeddings(index_db) + + # 3. Return comprehensive stats + return { + "fts_coverage": get_fts_stats(), + "embeddings_coverage": get_embeddings_stats(), + "features": detect_features() + } +``` + +### 方案 3:Smart Search 路由改进 + +```python +# 当前逻辑 +def classify_intent(query, hasIndex): + if not hasIndex: + return "ripgrep" + elif is_natural_language(query): + return "hybrid" # ❌ 但只有5个文件有embeddings! + else: + return "exact" + +# 改进逻辑 +def classify_intent(query, indexStatus): + embeddings_coverage = indexStatus.embeddings_coverage_percent + + if embeddings_coverage < 50: + # 如果覆盖率<50%,即使是自然语言也降级到exact + return "exact" if indexStatus.indexed else "ripgrep" + elif is_natural_language(query): + return "hybrid" + else: + return "exact" +``` + +--- + +## 验证用户质疑 + +### ❓ 为什么不为代码文件生成 embeddings? + +**答**:不是"不为代码文件生成",而是: +- ✅ 代码文件都有 FTS 索引 +- ❌ `embeddings-generate` 命令有BUG,**只处理根目录** +- ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表** + +### ❓ FTS 和 Vector 应该索引相同内容 + +**答**:**完全正确!** 当前实际情况: +- FTS: 303/303 (100%) +- Vector: 5/303 (1.6%) + +**这是严重的不一致性,违背了设计原则。** + +### ❓ Init 应该返回索引概况 + +**答**:**完全正确!** 当前 init 只返回简单成功消息,应该返回: +- FTS 索引统计 +- Embeddings 覆盖率 +- 功能特性状态 +- 警告信息(如果覆盖不完整) + +--- + +## 测试验证 + +### Hybrid Search 的实际效果 + +```javascript +// 当前查询 +smart_search(query="authentication patterns", mode="hybrid") + +// 实际搜索范围: +// ✅ 可搜索的文件:5个(根目录的.md文件) +// ❌ 不可搜索的文件:298个代码文件 +// 结果:返回的都是文档文件,代码文件被忽略 +``` + +### 修复后的效果(理想状态) + +```javascript +// 修复后 +smart_search(query="authentication patterns", mode="hybrid") + +// 实际搜索范围: +// ✅ 可搜索的文件:303个(所有文件) +// 结果:包含代码文件和文档文件的综合结果 +``` + +--- + +## 建议的修复优先级 + +### P0 - 紧急修复 + +1. **修复 `embeddings-generate` 命令** + - 递归处理所有子目录的 _index.db + - 为每个 _index.db 创建 semantic_chunks 表 + +2. **改进 `init` 返回信息** + - 返回详细的索引统计 + - 显示 embeddings 覆盖率 + - 如果覆盖不完整,给出警告 + +### P1 - 重要改进 + +3. **Smart Search 自适应路由** + - 检查 embeddings 覆盖率 + - 如果覆盖率低,自动降级到 exact 模式 + +4. **Status 命令增强** + - 显示每个子目录的索引状态 + - 显示 embeddings 分布情况 + +--- + +## 临时解决方案 + +### 当前推荐使用方式 + +```javascript +// 1. 文档搜索 - 使用 hybrid(有embeddings) +smart_search(query="architecture design patterns", mode="hybrid") + +// 2. 代码搜索 - 使用 exact(无embeddings,但有FTS) +smart_search(query="function executeQuery", mode="exact") + +// 3. 快速搜索 - 使用 ripgrep(跨所有文件) +smart_search(query="TODO", mode="ripgrep") +``` + +### 完整覆盖的变通方案 + +```bash +# 手动为所有子目录生成 embeddings(如果CodexLens支持) +cd D:\Claude_dms3\ccw + +# 为每个子目录分别运行 +python -m codexlens embeddings-generate ./src/tools --force +python -m codexlens embeddings-generate ./src/commands --force +# ... 重复26次 + +# 或使用脚本自动化 +python check_embeddings.py --generate-all +``` + +--- + +## 总结 + +| 用户质疑 | 状态 | 结论 | +|---------|------|------| +| 为什么不对代码生成embeddings? | ✅ 正确 | 是BUG,不是设计 | +| FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 | +| Init应返回详细概况 | ✅ 正确 | 当前信息不足 | + +**用户的所有质疑都是正确的,揭示了 CodexLens 的三个核心问题:** + +1. **Embeddings 生成不完整**(只有1.6%覆盖率) +2. **索引一致性问题**(FTS vs Vector) +3. **返回信息不透明**(缺少统计数据) + +--- + +**生成时间**:2025-12-17 +**验证方法**:`python check_embeddings.py` diff --git a/ccw/check_embeddings.py b/ccw/check_embeddings.py new file mode 100644 index 00000000..1704bbdb --- /dev/null +++ b/ccw/check_embeddings.py @@ -0,0 +1,47 @@ +import sqlite3 +import os + +# Find all _index.db files +root_dir = r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw' +index_files = [] +for dirpath, dirnames, filenames in os.walk(root_dir): + if '_index.db' in filenames: + index_files.append(os.path.join(dirpath, '_index.db')) + +print(f'Found {len(index_files)} index databases\n') + +total_files = 0 +total_chunks = 0 +dirs_with_chunks = 0 + +for db_path in sorted(index_files): + rel_path = db_path.replace(r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\\', '') + conn = sqlite3.connect(db_path) + + try: + cursor = conn.execute('SELECT COUNT(*) FROM files') + file_count = cursor.fetchone()[0] + total_files += file_count + + try: + cursor = conn.execute('SELECT COUNT(*) FROM semantic_chunks') + chunk_count = cursor.fetchone()[0] + total_chunks += chunk_count + + if chunk_count > 0: + dirs_with_chunks += 1 + print(f'[+] {rel_path:<40} Files: {file_count:3d} Chunks: {chunk_count:3d}') + else: + print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no chunks)') + except sqlite3.OperationalError: + print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no semantic_chunks table)') + except Exception as e: + print(f'[!] {rel_path:<40} Error: {e}') + finally: + conn.close() + +print(f'\n=== Summary ===') +print(f'Total index databases: {len(index_files)}') +print(f'Directories with embeddings: {dirs_with_chunks}') +print(f'Total files indexed: {total_files}') +print(f'Total semantic chunks: {total_chunks}') diff --git a/ccw/src/tools/smart-search.ts b/ccw/src/tools/smart-search.ts index d0724ec6..0879152e 100644 --- a/ccw/src/tools/smart-search.ts +++ b/ccw/src/tools/smart-search.ts @@ -1,12 +1,17 @@ /** - * Smart Search Tool - Unified search with mode-based execution - * Modes: auto, exact, fuzzy, semantic, graph + * Smart Search Tool - Unified intelligent search with CodexLens integration * * Features: - * - Intent classification (auto mode) - * - Multi-backend search routing - * - Result fusion with RRF ranking - * - Configurable search parameters + * - Intent classification with automatic mode selection + * - CodexLens integration (init, hybrid, vector, semantic) + * - Ripgrep fallback for exact mode + * - Index status checking and warnings + * - Multi-backend search routing with RRF ranking + * + * Actions: + * - init: Initialize CodexLens index + * - search: Intelligent search with auto mode selection + * - status: Check index status */ import { z } from 'zod'; @@ -19,19 +24,23 @@ import { // Define Zod schema for validation const ParamsSchema = z.object({ - query: z.string().min(1, 'Query is required'), - mode: z.enum(['auto', 'exact', 'fuzzy', 'semantic', 'graph']).default('auto'), + action: z.enum(['init', 'search', 'search_files', 'status']).default('search'), + query: z.string().optional(), + mode: z.enum(['auto', 'hybrid', 'exact', 'ripgrep']).default('auto'), output_mode: z.enum(['full', 'files_only', 'count']).default('full'), + path: z.string().optional(), paths: z.array(z.string()).default([]), contextLines: z.number().default(0), maxResults: z.number().default(100), includeHidden: z.boolean().default(false), + languages: z.array(z.string()).optional(), + limit: z.number().default(100), }); type Params = z.infer; // Search mode constants -const SEARCH_MODES = ['auto', 'exact', 'fuzzy', 'semantic', 'graph'] as const; +const SEARCH_MODES = ['auto', 'hybrid', 'exact', 'ripgrep'] as const; // Classification confidence threshold const CONFIDENCE_THRESHOLD = 0.7; @@ -70,16 +79,89 @@ interface SearchMetadata { classified_as?: string; confidence?: number; reasoning?: string; + embeddings_coverage_percent?: number; warning?: string; note?: string; + index_status?: 'indexed' | 'not_indexed' | 'partial'; } interface SearchResult { success: boolean; - results?: ExactMatch[] | SemanticMatch[] | GraphMatch[]; + results?: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown; output?: string; metadata?: SearchMetadata; error?: string; + status?: unknown; + message?: string; +} + +interface IndexStatus { + indexed: boolean; + has_embeddings: boolean; + file_count?: number; + embeddings_coverage_percent?: number; + warning?: string; +} + +/** + * Check if CodexLens index exists for current directory + * @param path - Directory path to check + * @returns Index status + */ +async function checkIndexStatus(path: string = '.'): Promise { + try { + const result = await executeCodexLens(['status', '--json'], { cwd: path }); + + if (!result.success) { + return { + indexed: false, + has_embeddings: false, + warning: 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.', + }; + } + + // Parse status output + try { + // Strip ANSI color codes from JSON output + const cleanOutput = (result.output || '{}').replace(/\x1b\[[0-9;]*m/g, ''); + const status = JSON.parse(cleanOutput); + const indexed = status.indexed === true || status.file_count > 0; + + // Get embeddings coverage from comprehensive status + const embeddingsData = status.embeddings || {}; + const embeddingsCoverage = embeddingsData.coverage_percent || 0; + const has_embeddings = embeddingsCoverage >= 50; // Threshold: 50% + + let warning: string | undefined; + if (!indexed) { + warning = 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.'; + } else if (embeddingsCoverage === 0) { + warning = 'Index exists but no embeddings generated. Run: codexlens embeddings-generate --recursive'; + } else if (embeddingsCoverage < 50) { + warning = `Embeddings coverage is ${embeddingsCoverage.toFixed(1)}% (below 50%). Hybrid search will use exact mode. Run: codexlens embeddings-generate --recursive`; + } + + return { + indexed, + has_embeddings, + file_count: status.file_count, + embeddings_coverage_percent: embeddingsCoverage, + warning, + }; + } catch { + return { + indexed: false, + has_embeddings: false, + warning: 'Failed to parse index status', + }; + } + } catch { + return { + indexed: false, + has_embeddings: false, + warning: 'CodexLens not available', + }; + } } /** @@ -123,43 +205,34 @@ function detectRelationship(query: string): boolean { /** * Classify query intent and recommend search mode + * Simple mapping: hybrid (NL + index + embeddings) | exact (index or insufficient embeddings) | ripgrep (no index) * @param query - Search query string + * @param hasIndex - Whether CodexLens index exists + * @param hasSufficientEmbeddings - Whether embeddings coverage >= 50% * @returns Classification result */ -function classifyIntent(query: string): Classification { - // Initialize mode scores - const scores: Record = { - exact: 0, - fuzzy: 0, - semantic: 0, - graph: 0, - }; +function classifyIntent(query: string, hasIndex: boolean = false, hasSufficientEmbeddings: boolean = false): Classification { + // Detect query patterns + const isNaturalLanguage = detectNaturalLanguage(query); - // Apply detection heuristics with weighted scoring - if (detectLiteral(query)) { - scores.exact += 0.8; + // Simple decision tree + let mode: string; + let confidence: number; + + if (!hasIndex) { + // No index: use ripgrep + mode = 'ripgrep'; + confidence = 1.0; + } else if (isNaturalLanguage && hasSufficientEmbeddings) { + // Natural language + sufficient embeddings: use hybrid + mode = 'hybrid'; + confidence = 0.9; + } else { + // Simple query OR insufficient embeddings: use exact + mode = 'exact'; + confidence = 0.8; } - if (detectRegex(query)) { - scores.fuzzy += 0.7; - } - - if (detectNaturalLanguage(query)) { - scores.semantic += 0.9; - } - - if (detectFilePath(query)) { - scores.exact += 0.6; - } - - if (detectRelationship(query)) { - scores.graph += 0.85; - } - - // Find mode with highest confidence score - const mode = Object.keys(scores).reduce((a, b) => (scores[a] > scores[b] ? a : b)); - const confidence = scores[mode]; - // Build reasoning string const detectedPatterns: string[] = []; if (detectLiteral(query)) detectedPatterns.push('literal'); @@ -168,7 +241,7 @@ function classifyIntent(query: string): Classification { if (detectFilePath(query)) detectedPatterns.push('file path'); if (detectRelationship(query)) detectedPatterns.push('relationship'); - const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')})`; + const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')}, index: ${hasIndex ? 'available' : 'not available'}, embeddings: ${hasSufficientEmbeddings ? 'sufficient' : 'insufficient'})`; return { mode, confidence, reasoning }; } @@ -234,105 +307,192 @@ function buildRipgrepCommand(params: { } /** - * Mode: auto - Intent classification and mode selection - * Analyzes query to determine optimal search mode + * Action: init - Initialize CodexLens index */ -async function executeAutoMode(params: Params): Promise { - const { query } = params; +async function executeInitAction(params: Params): Promise { + const { path = '.', languages } = params; - // Classify intent - const classification = classifyIntent(query); - - // Route to appropriate mode based on classification - switch (classification.mode) { - case 'exact': { - const exactResult = await executeExactMode(params); - return { - ...exactResult, - metadata: { - ...exactResult.metadata!, - classified_as: classification.mode, - confidence: classification.confidence, - reasoning: classification.reasoning, - }, - }; - } - - case 'fuzzy': - return { - success: false, - error: 'Fuzzy mode not yet implemented', - metadata: { - mode: 'fuzzy', - backend: '', - count: 0, - query, - classified_as: classification.mode, - confidence: classification.confidence, - reasoning: classification.reasoning, - }, - }; - - case 'semantic': { - const semanticResult = await executeSemanticMode(params); - return { - ...semanticResult, - metadata: { - ...semanticResult.metadata!, - classified_as: classification.mode, - confidence: classification.confidence, - reasoning: classification.reasoning, - }, - }; - } - - case 'graph': { - const graphResult = await executeGraphMode(params); - return { - ...graphResult, - metadata: { - ...graphResult.metadata!, - classified_as: classification.mode, - confidence: classification.confidence, - reasoning: classification.reasoning, - }, - }; - } - - default: { - const fallbackResult = await executeExactMode(params); - return { - ...fallbackResult, - metadata: { - ...fallbackResult.metadata!, - classified_as: 'exact', - confidence: 0.5, - reasoning: 'Fallback to exact mode due to unknown classification', - }, - }; - } - } -} - -/** - * Mode: exact - Precise file path and content matching - * Uses ripgrep for literal string matching - */ -async function executeExactMode(params: Params): Promise { - const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false } = params; - - // Check ripgrep availability - if (!checkToolAvailability('rg')) { + // Check CodexLens availability + const readyStatus = await ensureCodexLensReady(); + if (!readyStatus.ready) { return { success: false, - error: 'ripgrep not available - please install ripgrep (rg) to use exact search mode', + error: `CodexLens not available: ${readyStatus.error}. CodexLens will be auto-installed on first use.`, }; } - // Build ripgrep command + const args = ['init', path]; + if (languages && languages.length > 0) { + args.push('--languages', languages.join(',')); + } + + const result = await executeCodexLens(args, { cwd: path, timeout: 300000 }); + + return { + success: result.success, + error: result.error, + message: result.success + ? `CodexLens index created successfully for ${path}` + : undefined, + }; +} + +/** + * Action: status - Check CodexLens index status + */ +async function executeStatusAction(params: Params): Promise { + const { path = '.' } = params; + + const indexStatus = await checkIndexStatus(path); + + return { + success: true, + status: indexStatus, + message: indexStatus.warning || `Index status: ${indexStatus.indexed ? 'indexed' : 'not indexed'}, embeddings: ${indexStatus.has_embeddings ? 'available' : 'not available'}`, + }; +} + +/** + * Mode: auto - Intent classification and mode selection + * Routes to: hybrid (NL + index) | exact (index) | ripgrep (no index) + */ +async function executeAutoMode(params: Params): Promise { + const { query, path = '.' } = params; + + if (!query) { + return { + success: false, + error: 'Query is required for search action', + }; + } + + // Check index status + const indexStatus = await checkIndexStatus(path); + + // Classify intent with index and embeddings awareness + const classification = classifyIntent( + query, + indexStatus.indexed, + indexStatus.has_embeddings // This now considers 50% threshold + ); + + // Route to appropriate mode based on classification + let result: SearchResult; + + switch (classification.mode) { + case 'hybrid': + result = await executeHybridMode(params); + break; + + case 'exact': + result = await executeCodexLensExactMode(params); + break; + + case 'ripgrep': + result = await executeRipgrepMode(params); + break; + + default: + // Fallback to ripgrep + result = await executeRipgrepMode(params); + break; + } + + // Add classification metadata + if (result.metadata) { + result.metadata.classified_as = classification.mode; + result.metadata.confidence = classification.confidence; + result.metadata.reasoning = classification.reasoning; + result.metadata.embeddings_coverage_percent = indexStatus.embeddings_coverage_percent; + result.metadata.index_status = indexStatus.indexed + ? (indexStatus.has_embeddings ? 'indexed' : 'partial') + : 'not_indexed'; + + // Add warning if needed + if (indexStatus.warning) { + result.metadata.warning = indexStatus.warning; + } + } + + return result; +} + +/** + * Mode: ripgrep - Fast literal string matching using ripgrep + * No index required, fallback to CodexLens if ripgrep unavailable + */ +async function executeRipgrepMode(params: Params): Promise { + const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false, path = '.' } = params; + + if (!query) { + return { + success: false, + error: 'Query is required for search', + }; + } + + // Check if ripgrep is available + const hasRipgrep = checkToolAvailability('rg'); + + // If ripgrep not available, fall back to CodexLens exact mode + if (!hasRipgrep) { + const readyStatus = await ensureCodexLensReady(); + if (!readyStatus.ready) { + return { + success: false, + error: 'Neither ripgrep nor CodexLens available. Install ripgrep (rg) or CodexLens for search functionality.', + }; + } + + // Use CodexLens exact mode as fallback + const args = ['search', query, '--limit', maxResults.toString(), '--mode', 'exact', '--json']; + const result = await executeCodexLens(args, { cwd: path }); + + if (!result.success) { + return { + success: false, + error: result.error, + metadata: { + mode: 'ripgrep', + backend: 'codexlens-fallback', + count: 0, + query, + }, + }; + } + + // Parse results + let results: SemanticMatch[] = []; + try { + const parsed = JSON.parse(result.output || '{}'); + const data = parsed.results || parsed; + results = (Array.isArray(data) ? data : []).map((item: any) => ({ + file: item.path || item.file, + score: item.score || 0, + content: item.excerpt || item.content || '', + symbol: item.symbol || null, + })); + } catch { + // Keep empty results + } + + return { + success: true, + results, + metadata: { + mode: 'ripgrep', + backend: 'codexlens-fallback', + count: results.length, + query, + note: 'Using CodexLens exact mode (ripgrep not available)', + }, + }; + } + + // Use ripgrep const { command, args } = buildRipgrepCommand({ query, - paths: paths.length > 0 ? paths : ['.'], + paths: paths.length > 0 ? paths : [path], contextLines, maxResults, includeHidden, @@ -340,7 +500,7 @@ async function executeExactMode(params: Params): Promise { return new Promise((resolve) => { const child = spawn(command, args, { - cwd: process.cwd(), + cwd: path || process.cwd(), stdio: ['ignore', 'pipe', 'pipe'], }); @@ -386,7 +546,7 @@ async function executeExactMode(params: Params): Promise { success: true, results, metadata: { - mode: 'exact', + mode: 'ripgrep', backend: 'ripgrep', count: results.length, query, @@ -412,60 +572,126 @@ async function executeExactMode(params: Params): Promise { } /** - * Mode: fuzzy - Approximate matching with tolerance - * Uses fuzzy matching algorithms for typo-tolerant search + * Mode: exact - CodexLens exact/FTS search + * Requires index */ -async function executeFuzzyMode(params: Params): Promise { - return { - success: false, - error: 'Fuzzy mode not implemented - fuzzy matching engine pending', - }; -} +async function executeCodexLensExactMode(params: Params): Promise { + const { query, path = '.', limit = 100 } = params; -/** - * Mode: semantic - Natural language understanding search - * Uses CodexLens embeddings for semantic similarity - */ -async function executeSemanticMode(params: Params): Promise { - const { query, paths = [], maxResults = 100 } = params; + if (!query) { + return { + success: false, + error: 'Query is required for search', + }; + } // Check CodexLens availability const readyStatus = await ensureCodexLensReady(); if (!readyStatus.ready) { return { success: false, - error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`, + error: `CodexLens not available: ${readyStatus.error}`, }; } - // Determine search path - const searchPath = paths.length > 0 ? paths[0] : '.'; + // Check index status + const indexStatus = await checkIndexStatus(path); - // Execute CodexLens semantic search - const result = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], { - cwd: searchPath, - }); + const args = ['search', query, '--limit', limit.toString(), '--mode', 'exact', '--json']; + const result = await executeCodexLens(args, { cwd: path }); if (!result.success) { return { success: false, error: result.error, metadata: { - mode: 'semantic', + mode: 'exact', backend: 'codexlens', count: 0, query, + warning: indexStatus.warning, }, }; } - // Parse and transform results + // Parse results let results: SemanticMatch[] = []; try { - const cleanOutput = result.output!.replace(/\r\n/g, '\n'); - const parsed = JSON.parse(cleanOutput); - const data = parsed.result || parsed; - results = (data.results || []).map((item: any) => ({ + const parsed = JSON.parse(result.output || '{}'); + const data = parsed.results || parsed; + results = (Array.isArray(data) ? data : []).map((item: any) => ({ + file: item.path || item.file, + score: item.score || 0, + content: item.excerpt || item.content || '', + symbol: item.symbol || null, + })); + } catch { + // Keep empty results + } + + return { + success: true, + results, + metadata: { + mode: 'exact', + backend: 'codexlens', + count: results.length, + query, + warning: indexStatus.warning, + }, + }; +} + +/** + * Mode: hybrid - Best quality search with RRF fusion + * Uses CodexLens hybrid mode (exact + fuzzy + vector) + * Requires index with embeddings + */ +async function executeHybridMode(params: Params): Promise { + const { query, path = '.', limit = 100 } = params; + + if (!query) { + return { + success: false, + error: 'Query is required for search', + }; + } + + // Check CodexLens availability + const readyStatus = await ensureCodexLensReady(); + if (!readyStatus.ready) { + return { + success: false, + error: `CodexLens not available: ${readyStatus.error}`, + }; + } + + // Check index status + const indexStatus = await checkIndexStatus(path); + + const args = ['search', query, '--limit', limit.toString(), '--mode', 'hybrid', '--json']; + const result = await executeCodexLens(args, { cwd: path }); + + if (!result.success) { + return { + success: false, + error: result.error, + metadata: { + mode: 'hybrid', + backend: 'codexlens', + count: 0, + query, + warning: indexStatus.warning, + }, + }; + } + + // Parse results + let results: SemanticMatch[] = []; + try { + const parsed = JSON.parse(result.output || '{}'); + const data = parsed.results || parsed; + results = (Array.isArray(data) ? data : []).map((item: any) => ({ file: item.path || item.file, score: item.score || 0, content: item.excerpt || item.content || '', @@ -477,11 +703,11 @@ async function executeSemanticMode(params: Params): Promise { results: [], output: result.output, metadata: { - mode: 'semantic', + mode: 'hybrid', backend: 'codexlens', count: 0, query, - warning: 'Failed to parse JSON output', + warning: indexStatus.warning || 'Failed to parse JSON output', }, }; } @@ -490,105 +716,12 @@ async function executeSemanticMode(params: Params): Promise { success: true, results, metadata: { - mode: 'semantic', + mode: 'hybrid', backend: 'codexlens', count: results.length, query, - }, - }; -} - -/** - * Mode: graph - Dependency and relationship traversal - * Uses CodexLens symbol extraction for code analysis - */ -async function executeGraphMode(params: Params): Promise { - const { query, paths = [], maxResults = 100 } = params; - - // Check CodexLens availability - const readyStatus = await ensureCodexLensReady(); - if (!readyStatus.ready) { - return { - success: false, - error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`, - }; - } - - // First, search for relevant files using text search - const searchPath = paths.length > 0 ? paths[0] : '.'; - - const textResult = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], { - cwd: searchPath, - }); - - if (!textResult.success) { - return { - success: false, - error: textResult.error, - metadata: { - mode: 'graph', - backend: 'codexlens', - count: 0, - query, - }, - }; - } - - // Parse results and extract symbols from top files - let results: GraphMatch[] = []; - try { - const parsed = JSON.parse(textResult.output!); - const files = [...new Set((parsed.results || parsed).map((item: any) => item.path || item.file))].slice( - 0, - 10 - ); - - // Extract symbols from files in parallel - const symbolPromises = files.map((file) => - executeCodexLens(['symbol', file as string, '--json'], { cwd: searchPath }).then((result) => ({ - file, - result, - })) - ); - - const symbolResults = await Promise.all(symbolPromises); - - for (const { file, result } of symbolResults) { - if (result.success) { - try { - const symbols = JSON.parse(result.output!); - results.push({ - file: file as string, - symbols: symbols.symbols || symbols, - relationships: [], - }); - } catch { - // Skip files with parse errors - } - } - } - } catch { - return { - success: false, - error: 'Failed to parse search results', - metadata: { - mode: 'graph', - backend: 'codexlens', - count: 0, - query, - }, - }; - } - - return { - success: true, - results, - metadata: { - mode: 'graph', - backend: 'codexlens', - count: results.length, - query, - note: 'Graph mode provides symbol extraction; full dependency graph analysis pending', + note: 'Hybrid mode uses RRF fusion (exact + fuzzy + vector) for best results', + warning: indexStatus.warning, }, }; } @@ -596,36 +729,73 @@ async function executeGraphMode(params: Params): Promise { // Tool schema for MCP export const schema: ToolSchema = { name: 'smart_search', - description: `Intelligent code search with multiple modes. + description: `Intelligent code search with three optimized modes: hybrid, exact, ripgrep. -Usage: - smart_search(query="function main", path=".") # Auto-select mode - smart_search(query="def init", mode="exact") # Exact match - smart_search(query="authentication logic", mode="semantic") # NL search +**Quick Start:** + smart_search(query="authentication logic") # Auto mode (intelligent routing) + smart_search(action="init", path=".") # Initialize index (required for hybrid) + smart_search(action="status") # Check index status -Modes: auto (default), exact, fuzzy, semantic, graph`, +**Three Core Modes:** + 1. auto (default): Intelligent routing based on query and index + - Natural language + index → hybrid + - Simple query + index → exact + - No index → ripgrep + + 2. hybrid: CodexLens RRF fusion (exact + fuzzy + vector) + - Best quality, semantic understanding + - Requires index with embeddings + + 3. exact: CodexLens FTS (full-text search) + - Precise keyword matching + - Requires index + + 4. ripgrep: Direct ripgrep execution + - Fast, no index required + - Literal string matching + +**Actions:** + - search (default): Intelligent search with auto routing + - init: Create CodexLens index (required for hybrid/exact) + - status: Check index and embedding availability + - search_files: Return file paths only + +**Workflow:** + 1. Run action="init" to create index + 2. Use auto mode - it routes to hybrid for NL queries, exact for simple queries + 3. Use ripgrep mode for fast searches without index`, inputSchema: { type: 'object', properties: { + action: { + type: 'string', + enum: ['init', 'search', 'search_files', 'status'], + description: 'Action to perform: init (create index), search (default), search_files (paths only), status (check index)', + default: 'search', + }, query: { type: 'string', - description: 'Search query (file pattern, text content, or natural language)', + description: 'Search query (required for search/search_files actions)', }, mode: { type: 'string', enum: SEARCH_MODES, - description: 'Search mode (default: auto)', + description: 'Search mode: auto (default), hybrid (best quality), exact (CodexLens FTS), ripgrep (fast, no index)', default: 'auto', }, output_mode: { type: 'string', enum: ['full', 'files_only', 'count'], - description: 'Output mode: full (default), files_only (paths only), count (per-file counts)', + description: 'Output format: full (default), files_only (paths only), count (per-file counts)', default: 'full', }, + path: { + type: 'string', + description: 'Directory path for init/search actions (default: current directory)', + }, paths: { type: 'array', - description: 'Paths to search within (default: current directory)', + description: 'Multiple paths to search within (for search action)', items: { type: 'string', }, @@ -633,21 +803,31 @@ Modes: auto (default), exact, fuzzy, semantic, graph`, }, contextLines: { type: 'number', - description: 'Number of context lines around matches (default: 0)', + description: 'Number of context lines around matches (exact mode only)', default: 0, }, maxResults: { type: 'number', - description: 'Maximum number of results to return (default: 100)', + description: 'Maximum number of results (default: 100)', + default: 100, + }, + limit: { + type: 'number', + description: 'Alias for maxResults', default: 100, }, includeHidden: { type: 'boolean', - description: 'Include hidden files/directories (default: false)', + description: 'Include hidden files/directories', default: false, }, + languages: { + type: 'array', + items: { type: 'string' }, + description: 'Languages to index (for init action). Example: ["javascript", "typescript"]', + }, }, - required: ['query'], + required: [], }, }; @@ -655,20 +835,27 @@ Modes: auto (default), exact, fuzzy, semantic, graph`, * Transform results based on output_mode */ function transformOutput( - results: ExactMatch[] | SemanticMatch[] | GraphMatch[], + results: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown[], outputMode: 'full' | 'files_only' | 'count' ): unknown { + if (!Array.isArray(results)) { + return results; + } + switch (outputMode) { case 'files_only': { // Extract unique file paths - const files = [...new Set(results.map((r) => r.file))]; + const files = [...new Set(results.map((r: any) => r.file))].filter(Boolean); return { files, count: files.length }; } case 'count': { // Count matches per file const counts: Record = {}; for (const r of results) { - counts[r.file] = (counts[r.file] || 0) + 1; + const file = (r as any).file; + if (file) { + counts[file] = (counts[file] || 0) + 1; + } } return { files: Object.entries(counts).map(([file, count]) => ({ file, count })), @@ -688,34 +875,58 @@ export async function handler(params: Record): Promise= 50% + has_vector_search = embeddings_info["coverage_percent"] >= 50.0 + except ImportError: + # Embedding manager not available + pass + except Exception as e: + logger.debug(f"Failed to get embeddings status: {e}") + stats = { "index_root": str(index_root), "registry_path": str(_get_registry_path()), @@ -624,9 +669,13 @@ def status( "exact_fts": True, # Always available "fuzzy_fts": has_dual_fts, "hybrid_search": has_dual_fts, - "vector_search": False, # Not yet implemented + "vector_search": has_vector_search, }, } + + # Add embeddings info if available + if embeddings_info: + stats["embeddings"] = embeddings_info if json_mode: print_json(success=True, result=stats) @@ -648,7 +697,20 @@ def status( else: console.print(f" Fuzzy FTS: ✗ (run 'migrate' to enable)") console.print(f" Hybrid Search: ✗ (run 'migrate' to enable)") - console.print(f" Vector Search: ✗ (future)") + + if has_vector_search: + console.print(f" Vector Search: ✓ (embeddings available)") + else: + console.print(f" Vector Search: ✗ (no embeddings or coverage < 50%)") + + # Display embeddings statistics if available + if embeddings_info: + console.print("\n[bold]Embeddings Coverage:[/bold]") + console.print(f" Total Indexes: {embeddings_info['total_indexes']}") + console.print(f" Total Files: {embeddings_info['total_files']}") + console.print(f" Files with Embeddings: {embeddings_info['files_with_embeddings']}") + console.print(f" Coverage: {embeddings_info['coverage_percent']:.1f}%") + console.print(f" Total Chunks: {embeddings_info['total_chunks']}") except StorageError as exc: if json_mode: @@ -1885,6 +1947,12 @@ def embeddings_generate( "--chunk-size", help="Maximum chunk size in characters.", ), + recursive: bool = typer.Option( + False, + "--recursive", + "-r", + help="Recursively process all _index.db files in directory tree.", + ), json_mode: bool = typer.Option(False, "--json", help="Output JSON response."), verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose output."), ) -> None: @@ -1908,28 +1976,42 @@ def embeddings_generate( _configure_logging(verbose) try: - from codexlens.cli.embedding_manager import generate_embeddings + from codexlens.cli.embedding_manager import generate_embeddings, generate_embeddings_recursive # Resolve path target_path = path.expanduser().resolve() + # Determine if we should use recursive mode + use_recursive = False + index_path = None + index_root = None + if target_path.is_file() and target_path.name == "_index.db": # Direct index file index_path = target_path + if recursive: + # Use parent directory for recursive processing + use_recursive = True + index_root = target_path.parent elif target_path.is_dir(): - # Try to find index for this project - registry = RegistryStore() - try: - registry.initialize() - mapper = PathMapper() - index_path = mapper.source_to_index_db(target_path) + if recursive: + # Recursive mode: process all _index.db files in directory tree + use_recursive = True + index_root = target_path + else: + # Non-recursive: Try to find index for this project + registry = RegistryStore() + try: + registry.initialize() + mapper = PathMapper() + index_path = mapper.source_to_index_db(target_path) - if not index_path.exists(): - console.print(f"[red]Error:[/red] No index found for {target_path}") - console.print("Run 'codexlens init' first to create an index") - raise typer.Exit(code=1) - finally: - registry.close() + if not index_path.exists(): + console.print(f"[red]Error:[/red] No index found for {target_path}") + console.print("Run 'codexlens init' first to create an index") + raise typer.Exit(code=1) + finally: + registry.close() else: console.print(f"[red]Error:[/red] Path must be _index.db file or directory") raise typer.Exit(code=1) @@ -1940,16 +2022,29 @@ def embeddings_generate( console.print(f" {msg}") console.print(f"[bold]Generating embeddings[/bold]") - console.print(f"Index: [dim]{index_path}[/dim]") + if use_recursive: + console.print(f"Index root: [dim]{index_root}[/dim]") + console.print(f"Mode: [yellow]Recursive[/yellow]") + else: + console.print(f"Index: [dim]{index_path}[/dim]") console.print(f"Model: [cyan]{model}[/cyan]\n") - result = generate_embeddings( - index_path, - model_profile=model, - force=force, - chunk_size=chunk_size, - progress_callback=progress_update, - ) + if use_recursive: + result = generate_embeddings_recursive( + index_root, + model_profile=model, + force=force, + chunk_size=chunk_size, + progress_callback=progress_update, + ) + else: + result = generate_embeddings( + index_path, + model_profile=model, + force=force, + chunk_size=chunk_size, + progress_callback=progress_update, + ) if json_mode: print_json(**result) @@ -1968,21 +2063,45 @@ def embeddings_generate( raise typer.Exit(code=1) data = result["result"] - elapsed = data["elapsed_time"] - console.print(f"[green]✓[/green] Embeddings generated successfully!") - console.print(f" Model: {data['model_name']}") - console.print(f" Chunks created: {data['chunks_created']:,}") - console.print(f" Files processed: {data['files_processed']}") + if use_recursive: + # Recursive mode output + console.print(f"[green]✓[/green] Recursive embeddings generation complete!") + console.print(f" Indexes processed: {data['indexes_processed']}") + console.print(f" Indexes successful: {data['indexes_successful']}") + if data['indexes_failed'] > 0: + console.print(f" [yellow]Indexes failed: {data['indexes_failed']}[/yellow]") + console.print(f" Total chunks created: {data['total_chunks_created']:,}") + console.print(f" Total files processed: {data['total_files_processed']}") + if data['total_files_failed'] > 0: + console.print(f" [yellow]Total files failed: {data['total_files_failed']}[/yellow]") + console.print(f" Model profile: {data['model_profile']}") - if data["files_failed"] > 0: - console.print(f" [yellow]Files failed: {data['files_failed']}[/yellow]") - if data["failed_files"]: - console.print(" [dim]First failures:[/dim]") - for file_path, error in data["failed_files"]: - console.print(f" [dim]{file_path}: {error}[/dim]") + # Show details if verbose + if verbose and data.get('details'): + console.print("\n[dim]Index details:[/dim]") + for detail in data['details']: + status_icon = "[green]✓[/green]" if detail['success'] else "[red]✗[/red]" + console.print(f" {status_icon} {detail['path']}") + if not detail['success'] and detail.get('error'): + console.print(f" [dim]Error: {detail['error']}[/dim]") + else: + # Single index mode output + elapsed = data["elapsed_time"] - console.print(f" Time: {elapsed:.1f}s") + console.print(f"[green]✓[/green] Embeddings generated successfully!") + console.print(f" Model: {data['model_name']}") + console.print(f" Chunks created: {data['chunks_created']:,}") + console.print(f" Files processed: {data['files_processed']}") + + if data["files_failed"] > 0: + console.print(f" [yellow]Files failed: {data['files_failed']}[/yellow]") + if data["failed_files"]: + console.print(" [dim]First failures:[/dim]") + for file_path, error in data["failed_files"]: + console.print(f" [dim]{file_path}: {error}[/dim]") + + console.print(f" Time: {elapsed:.1f}s") console.print("\n[dim]Use vector search with:[/dim]") console.print(" [cyan]codexlens search 'your query' --mode pure-vector[/cyan]") diff --git a/codex-lens/src/codexlens/cli/embedding_manager.py b/codex-lens/src/codexlens/cli/embedding_manager.py index 65422a66..6fed8ec4 100644 --- a/codex-lens/src/codexlens/cli/embedding_manager.py +++ b/codex-lens/src/codexlens/cli/embedding_manager.py @@ -255,6 +255,21 @@ def generate_embeddings( } +def discover_all_index_dbs(index_root: Path) -> List[Path]: + """Recursively find all _index.db files in an index tree. + + Args: + index_root: Root directory to scan for _index.db files + + Returns: + Sorted list of paths to _index.db files + """ + if not index_root.exists(): + return [] + + return sorted(index_root.rglob("_index.db")) + + def find_all_indexes(scan_dir: Path) -> List[Path]: """Find all _index.db files in directory tree. @@ -270,6 +285,146 @@ def find_all_indexes(scan_dir: Path) -> List[Path]: return list(scan_dir.rglob("_index.db")) + +def generate_embeddings_recursive( + index_root: Path, + model_profile: str = "code", + force: bool = False, + chunk_size: int = 2000, + progress_callback: Optional[callable] = None, +) -> Dict[str, any]: + """Generate embeddings for all index databases in a project recursively. + + Args: + index_root: Root index directory containing _index.db files + model_profile: Model profile (fast, code, multilingual, balanced) + force: If True, regenerate even if embeddings exist + chunk_size: Maximum chunk size in characters + progress_callback: Optional callback for progress updates + + Returns: + Aggregated result dictionary with generation statistics + """ + # Discover all _index.db files + index_files = discover_all_index_dbs(index_root) + + if not index_files: + return { + "success": False, + "error": f"No index databases found in {index_root}", + } + + if progress_callback: + progress_callback(f"Found {len(index_files)} index databases to process") + + # Process each index database + all_results = [] + total_chunks = 0 + total_files_processed = 0 + total_files_failed = 0 + + for idx, index_path in enumerate(index_files, 1): + if progress_callback: + try: + rel_path = index_path.relative_to(index_root) + except ValueError: + rel_path = index_path + progress_callback(f"[{idx}/{len(index_files)}] Processing {rel_path}") + + result = generate_embeddings( + index_path, + model_profile=model_profile, + force=force, + chunk_size=chunk_size, + progress_callback=None, # Don't cascade callbacks + ) + + all_results.append({ + "path": str(index_path), + "success": result["success"], + "result": result.get("result"), + "error": result.get("error"), + }) + + if result["success"]: + data = result["result"] + total_chunks += data["chunks_created"] + total_files_processed += data["files_processed"] + total_files_failed += data["files_failed"] + + successful = sum(1 for r in all_results if r["success"]) + + return { + "success": successful > 0, + "result": { + "indexes_processed": len(index_files), + "indexes_successful": successful, + "indexes_failed": len(index_files) - successful, + "total_chunks_created": total_chunks, + "total_files_processed": total_files_processed, + "total_files_failed": total_files_failed, + "model_profile": model_profile, + "details": all_results, + }, + } + + +def get_embeddings_status(index_root: Path) -> Dict[str, any]: + """Get comprehensive embeddings coverage status for all indexes. + + Args: + index_root: Root index directory + + Returns: + Aggregated status with coverage statistics + """ + index_files = discover_all_index_dbs(index_root) + + if not index_files: + return { + "success": True, + "result": { + "total_indexes": 0, + "total_files": 0, + "files_with_embeddings": 0, + "files_without_embeddings": 0, + "total_chunks": 0, + "coverage_percent": 0.0, + "indexes_with_embeddings": 0, + "indexes_without_embeddings": 0, + }, + } + + total_files = 0 + files_with_embeddings = 0 + total_chunks = 0 + indexes_with_embeddings = 0 + + for index_path in index_files: + status = check_index_embeddings(index_path) + if status["success"]: + result = status["result"] + total_files += result["total_files"] + files_with_embeddings += result["files_with_chunks"] + total_chunks += result["total_chunks"] + if result["has_embeddings"]: + indexes_with_embeddings += 1 + + return { + "success": True, + "result": { + "total_indexes": len(index_files), + "total_files": total_files, + "files_with_embeddings": files_with_embeddings, + "files_without_embeddings": total_files - files_with_embeddings, + "total_chunks": total_chunks, + "coverage_percent": round((files_with_embeddings / total_files * 100) if total_files > 0 else 0, 1), + "indexes_with_embeddings": indexes_with_embeddings, + "indexes_without_embeddings": len(index_files) - indexes_with_embeddings, + }, + } + + def get_embedding_stats_summary(index_root: Path) -> Dict[str, any]: """Get summary statistics for all indexes in root directory.