mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-13 02:41:50 +08:00
Fix CodexLens embeddings generation to achieve 100% coverage
Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files). This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage with 2,042 semantic chunks across all 303 files in 26 index databases. Key improvements: 1. **Recursive embeddings generation** (embedding_manager.py): - Add generate_embeddings_recursive() to process all _index.db files in directory tree - Add get_embeddings_status() for comprehensive coverage statistics - Add discover_all_index_dbs() helper for recursive file discovery 2. **Enhanced CLI commands** (commands.py): - embeddings-generate: Add --recursive flag for full project coverage - init: Use recursive generation by default for complete indexing - status: Display embeddings coverage statistics with 50% threshold 3. **Smart search routing improvements** (smart-search.ts): - Add 50% embeddings coverage threshold for hybrid mode routing - Auto-fallback to exact mode when coverage insufficient - Strip ANSI color codes from JSON output for correct parsing - Add embeddings_coverage_percent to IndexStatus and SearchMetadata - Provide clear warnings with actionable suggestions 4. **Documentation and analysis**: - Add SMART_SEARCH_ANALYSIS.md with initial investigation - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary - Add check_embeddings.py script for coverage verification Results: - Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase - Semantic chunks increased from 10 to 2,042 - 204x increase - All 26 subdirectory indexes now have embeddings vs just 1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
165
ccw/EMBEDDINGS_FIX_SUMMARY.md
Normal file
165
ccw/EMBEDDINGS_FIX_SUMMARY.md
Normal file
@@ -0,0 +1,165 @@
|
|||||||
|
# CodexLens Embeddings 修复总结
|
||||||
|
|
||||||
|
## 修复成果
|
||||||
|
|
||||||
|
### ✅ 已完成
|
||||||
|
|
||||||
|
1. **递归 embeddings 生成功能** (`embedding_manager.py`)
|
||||||
|
- 添加 `generate_embeddings_recursive()` 函数
|
||||||
|
- 添加 `get_embeddings_status()` 函数
|
||||||
|
- 递归处理所有子目录的 _index.db 文件
|
||||||
|
|
||||||
|
2. **CLI 命令增强** (`commands.py`)
|
||||||
|
- `embeddings-generate` 添加 `--recursive` 标志
|
||||||
|
- `init` 命令使用递归生成(自动处理所有子目录)
|
||||||
|
- `status` 命令显示 embeddings 覆盖率统计
|
||||||
|
|
||||||
|
3. **Smart Search 智能路由** (`smart-search.ts`)
|
||||||
|
- 添加 50% 覆盖率阈值
|
||||||
|
- embeddings 不足时自动降级到 exact 模式
|
||||||
|
- 提供明确的警告信息
|
||||||
|
- Strip ANSI 颜色码以正确解析 JSON
|
||||||
|
|
||||||
|
### ✅ 测试结果
|
||||||
|
|
||||||
|
**CCW 项目 (d:\Claude_dms3\ccw)**:
|
||||||
|
- 索引数据库:26 个
|
||||||
|
- 文件总数:303
|
||||||
|
- Embeddings 覆盖:**100%** (所有 303 个文件)
|
||||||
|
- 生成 chunks:**2,042** (之前只有 10)
|
||||||
|
|
||||||
|
**对比**:
|
||||||
|
| 指标 | 修复前 | 修复后 | 改进 |
|
||||||
|
|------|--------|--------|------|
|
||||||
|
| 覆盖率 | 1.6% (5/303) | 100% (303/303) | **62.5x** |
|
||||||
|
| Chunks | 10 | 2,042 | **204x** |
|
||||||
|
| 有 embeddings 的索引 | 1/26 | 26/26 | **26x** |
|
||||||
|
|
||||||
|
## 当前问题
|
||||||
|
|
||||||
|
### ⚠️ 遗留问题
|
||||||
|
|
||||||
|
1. **路径映射问题**
|
||||||
|
- `embeddings-generate --recursive` 需要使用索引路径而非源路径
|
||||||
|
- 用户应该能够使用源路径(`d:\Claude_dms3\ccw`)
|
||||||
|
- 当前需要使用:`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw`
|
||||||
|
|
||||||
|
2. **Status 命令的全局 vs 项目级别**
|
||||||
|
- `codexlens status` 返回全局统计(所有项目)
|
||||||
|
- 需要项目级别的 embeddings 状态
|
||||||
|
- `embeddings-status` 只检查单个 _index.db,不递归
|
||||||
|
|
||||||
|
## 建议的后续修复
|
||||||
|
|
||||||
|
### P1 - 路径映射修复
|
||||||
|
|
||||||
|
修改 `commands.py` 中的 `embeddings_generate` 命令(line 1996-2000):
|
||||||
|
|
||||||
|
```python
|
||||||
|
elif target_path.is_dir():
|
||||||
|
if recursive:
|
||||||
|
# Recursive mode: Map source path to index root
|
||||||
|
registry = RegistryStore()
|
||||||
|
try:
|
||||||
|
registry.initialize()
|
||||||
|
mapper = PathMapper()
|
||||||
|
index_db_path = mapper.source_to_index_db(target_path)
|
||||||
|
index_root = index_db_path.parent # Use index directory root
|
||||||
|
use_recursive = True
|
||||||
|
finally:
|
||||||
|
registry.close()
|
||||||
|
```
|
||||||
|
|
||||||
|
### P2 - 项目级别 Status
|
||||||
|
|
||||||
|
选项 A:扩展 `embeddings-status` 命令支持递归
|
||||||
|
```bash
|
||||||
|
codexlens embeddings-status . --recursive --json
|
||||||
|
```
|
||||||
|
|
||||||
|
选项 B:修改 `status` 命令接受路径参数
|
||||||
|
```bash
|
||||||
|
codexlens status --project . --json
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用指南
|
||||||
|
|
||||||
|
### 当前工作流程
|
||||||
|
|
||||||
|
**生成 embeddings(完整覆盖)**:
|
||||||
|
```bash
|
||||||
|
# 方法 1: 使用索引路径(当前工作方式)
|
||||||
|
cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
|
||||||
|
python -m codexlens embeddings-generate . --recursive --force --model fast
|
||||||
|
|
||||||
|
# 方法 2: init 命令(自动递归,推荐)
|
||||||
|
cd d:\Claude_dms3\ccw
|
||||||
|
python -m codexlens init . --force
|
||||||
|
```
|
||||||
|
|
||||||
|
**检查覆盖率**:
|
||||||
|
```bash
|
||||||
|
# 项目根目录
|
||||||
|
cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
|
||||||
|
python check_embeddings.py # 显示详细的每目录统计
|
||||||
|
|
||||||
|
# 全局状态
|
||||||
|
python -m codexlens status --json # 所有项目的汇总
|
||||||
|
```
|
||||||
|
|
||||||
|
**Smart Search**:
|
||||||
|
```javascript
|
||||||
|
// MCP 工具调用
|
||||||
|
smart_search(query="authentication patterns")
|
||||||
|
|
||||||
|
// 现在会:
|
||||||
|
// 1. 检查 embeddings 覆盖率
|
||||||
|
// 2. 如果 >= 50%,使用 hybrid 模式
|
||||||
|
// 3. 如果 < 50%,降级到 exact 模式
|
||||||
|
// 4. 显示警告信息
|
||||||
|
```
|
||||||
|
|
||||||
|
### 最佳实践
|
||||||
|
|
||||||
|
1. **初始化项目时自动生成 embeddings**:
|
||||||
|
```bash
|
||||||
|
codexlens init /path/to/project --force
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **定期重新生成以更新**:
|
||||||
|
```bash
|
||||||
|
codexlens embeddings-generate /index/path --recursive --force
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **使用 fast 模型快速测试**:
|
||||||
|
```bash
|
||||||
|
codexlens embeddings-generate . --recursive --model fast
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **使用 code 模型获得最佳质量**:
|
||||||
|
```bash
|
||||||
|
codexlens embeddings-generate . --recursive --model code
|
||||||
|
```
|
||||||
|
|
||||||
|
## 技术细节
|
||||||
|
|
||||||
|
### 文件修改清单
|
||||||
|
|
||||||
|
**Python (CodexLens)**:
|
||||||
|
- `codex-lens/src/codexlens/cli/embedding_manager.py` - 添加递归函数
|
||||||
|
- `codex-lens/src/codexlens/cli/commands.py` - 更新 init, status, embeddings-generate
|
||||||
|
|
||||||
|
**TypeScript (CCW)**:
|
||||||
|
- `ccw/src/tools/smart-search.ts` - 智能路由 + ANSI stripping
|
||||||
|
- `ccw/src/tools/codex-lens.ts` - (未修改,使用现有实现)
|
||||||
|
|
||||||
|
### 依赖版本
|
||||||
|
|
||||||
|
- CodexLens: 当前开发版本
|
||||||
|
- Fastembed: 已安装(ONNX backend)
|
||||||
|
- Models: fast (~80MB), code (~150MB)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**修复时间**: 2025-12-17
|
||||||
|
**验证状态**: ✅ 核心功能正常,遗留路径映射问题待修复
|
||||||
167
ccw/SMART_SEARCH_ANALYSIS.md
Normal file
167
ccw/SMART_SEARCH_ANALYSIS.md
Normal file
@@ -0,0 +1,167 @@
|
|||||||
|
# Smart Search 索引分析报告
|
||||||
|
|
||||||
|
## 问题
|
||||||
|
分析当前 `smart_search(action="init")` 是否进行了向量模型索引,还是仅进行了基础索引。
|
||||||
|
|
||||||
|
## 分析结果
|
||||||
|
|
||||||
|
### 1. Init 操作的默认行为
|
||||||
|
|
||||||
|
从代码分析来看,`smart_search(action="init")` 的行为如下:
|
||||||
|
|
||||||
|
**代码路径**:`ccw/src/tools/smart-search.ts` → `ccw/src/tools/codex-lens.ts`
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// smart-search.ts: executeInitAction (第 297-323 行)
|
||||||
|
async function executeInitAction(params: Params): Promise<SearchResult> {
|
||||||
|
const { path = '.', languages } = params;
|
||||||
|
const args = ['init', path];
|
||||||
|
if (languages && languages.length > 0) {
|
||||||
|
args.push('--languages', languages.join(','));
|
||||||
|
}
|
||||||
|
const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
|
||||||
|
// ...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**关键发现**:
|
||||||
|
- `smart_search(action="init")` 调用 `codexlens init` 命令
|
||||||
|
- **不传递** `--no-embeddings` 参数
|
||||||
|
- **不传递** `--embedding-model` 参数
|
||||||
|
|
||||||
|
### 2. CodexLens Init 的默认行为
|
||||||
|
|
||||||
|
根据 `codexlens init --help` 的输出:
|
||||||
|
|
||||||
|
> If semantic search dependencies are installed, **automatically generates embeddings** after indexing completes. Use --no-embeddings to skip this step.
|
||||||
|
|
||||||
|
**结论**:
|
||||||
|
- ✅ `init` 命令**默认会**生成 embeddings(如果安装了语义搜索依赖)
|
||||||
|
- ❌ 当前实现**未生成**所有文件的 embeddings
|
||||||
|
|
||||||
|
### 3. 实际测试结果
|
||||||
|
|
||||||
|
#### 第一次 Init(未生成 embeddings)
|
||||||
|
```bash
|
||||||
|
$ smart_search(action="init", path="d:\\Claude_dms3\\ccw")
|
||||||
|
# 结果:索引了 303 个文件,但 vector_search: false
|
||||||
|
```
|
||||||
|
|
||||||
|
**原因分析**:
|
||||||
|
虽然语义搜索依赖(fastembed)已安装,但 init 过程中遇到警告:
|
||||||
|
```
|
||||||
|
Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate.
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 手动生成 Embeddings 后
|
||||||
|
```bash
|
||||||
|
$ python -m codexlens embeddings-generate . --force --verbose
|
||||||
|
|
||||||
|
Processing 5 files...
|
||||||
|
- D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks
|
||||||
|
- D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks
|
||||||
|
- D:\Claude_dms3\ccw\README.md: 2 chunks
|
||||||
|
- D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks
|
||||||
|
- D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks
|
||||||
|
|
||||||
|
Total: 10 chunks, 5 files
|
||||||
|
Model: jinaai/jina-embeddings-v2-base-code (768 dimensions)
|
||||||
|
```
|
||||||
|
|
||||||
|
**关键发现**:
|
||||||
|
- ⚠️ 只为 **5 个文档/配置文件**生成了 embeddings
|
||||||
|
- ⚠️ **未为 298 个代码文件**(.ts, .js 等)生成 embeddings
|
||||||
|
- ✅ Embeddings 状态显示 `coverage_percent: 100.0`(但这是针对"应该生成 embeddings 的文件"而言)
|
||||||
|
|
||||||
|
#### Hybrid Search 测试
|
||||||
|
```bash
|
||||||
|
$ smart_search(query="authentication and authorization patterns", mode="hybrid")
|
||||||
|
# ✅ 成功返回 5 个结果,带有相似度分数
|
||||||
|
# ✅ 证明向量搜索功能可用
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. 索引类型对比
|
||||||
|
|
||||||
|
| 索引类型 | 当前状态 | 支持的文件 | 说明 |
|
||||||
|
|---------|---------|-----------|------|
|
||||||
|
| **Exact FTS** | ✅ 启用 | 所有 303 个文件 | 基于 SQLite FTS5 的全文搜索 |
|
||||||
|
| **Fuzzy FTS** | ❌ 未启用 | - | 模糊匹配搜索 |
|
||||||
|
| **Vector Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | 基于 fastembed 的语义搜索 |
|
||||||
|
| **Hybrid Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | RRF 融合(exact + fuzzy + vector) |
|
||||||
|
|
||||||
|
## 5. 为什么只有 5 个文件有 Embeddings?
|
||||||
|
|
||||||
|
**可能的原因**:
|
||||||
|
|
||||||
|
1. **文件类型过滤**:CodexLens 可能只为文档文件(.md)和配置文件生成 embeddings
|
||||||
|
2. **代码文件使用符号索引**:代码文件(.ts, .js)可能依赖于符号提取而非文本 embeddings
|
||||||
|
3. **性能考虑**:生成 300+ 文件的 embeddings 需要大量时间和存储空间
|
||||||
|
|
||||||
|
## 6. 结论
|
||||||
|
|
||||||
|
### 当前 `smart_search(action="init")` 的行为:
|
||||||
|
|
||||||
|
✅ **会尝试**生成向量索引(如果语义依赖已安装)
|
||||||
|
⚠️ **实际只**为文档/配置文件生成 embeddings(5/303 文件)
|
||||||
|
✅ **支持** hybrid 模式搜索(对于有 embeddings 的文件)
|
||||||
|
✅ **支持** exact 模式搜索(对于所有 303 个文件)
|
||||||
|
|
||||||
|
### 搜索模式智能路由:
|
||||||
|
|
||||||
|
```
|
||||||
|
用户查询 → auto 模式 → 决策树:
|
||||||
|
├─ 自然语言查询 + 有 embeddings → hybrid 模式(RRF 融合)
|
||||||
|
├─ 简单查询 + 有索引 → exact 模式(FTS)
|
||||||
|
└─ 无索引 → ripgrep 模式(字面匹配)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7. 建议
|
||||||
|
|
||||||
|
### 如果需要完整的语义搜索支持:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 方案 1:检查是否所有代码文件都应该有 embeddings
|
||||||
|
python -m codexlens embeddings-status . --verbose
|
||||||
|
|
||||||
|
# 方案 2:明确为代码文件生成 embeddings(如果支持)
|
||||||
|
# 需要查看 CodexLens 文档确认代码文件的语义索引策略
|
||||||
|
|
||||||
|
# 方案 3:使用 hybrid 模式进行文档搜索,exact 模式进行代码搜索
|
||||||
|
smart_search(query="架构设计", mode="hybrid") # 文档语义搜索
|
||||||
|
smart_search(query="function_name", mode="exact") # 代码精确搜索
|
||||||
|
```
|
||||||
|
|
||||||
|
### 当前最佳实践:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// 1. 初始化索引(一次性)
|
||||||
|
smart_search(action="init", path=".")
|
||||||
|
|
||||||
|
// 2. 智能搜索(推荐使用 auto 模式)
|
||||||
|
smart_search(query="your query") // 自动选择最佳模式
|
||||||
|
|
||||||
|
// 3. 特定模式搜索
|
||||||
|
smart_search(query="natural language query", mode="hybrid") // 语义搜索
|
||||||
|
smart_search(query="exact_identifier", mode="exact") // 精确匹配
|
||||||
|
smart_search(query="quick literal", mode="ripgrep") // 快速字面搜索
|
||||||
|
```
|
||||||
|
|
||||||
|
## 8. 技术细节
|
||||||
|
|
||||||
|
### Embeddings 模型
|
||||||
|
- **模型**:jinaai/jina-embeddings-v2-base-code
|
||||||
|
- **维度**:768
|
||||||
|
- **大小**:~150MB
|
||||||
|
- **后端**:fastembed (ONNX-based)
|
||||||
|
|
||||||
|
### 索引存储
|
||||||
|
- **位置**:`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db`
|
||||||
|
- **大小**:122.57 MB
|
||||||
|
- **Schema 版本**:5
|
||||||
|
- **文件数**:303
|
||||||
|
- **目录数**:26
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**生成时间**:2025-12-17
|
||||||
|
**CodexLens 版本**:从当前安装中检测
|
||||||
330
ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
Normal file
330
ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
Normal file
@@ -0,0 +1,330 @@
|
|||||||
|
# Smart Search 索引分析报告(修正版)
|
||||||
|
|
||||||
|
## 用户质疑
|
||||||
|
|
||||||
|
1. ❓ 为什么不为代码文件生成向量 embeddings?
|
||||||
|
2. ❓ Exact FTS 和 Vector 索引内容应该一样才对
|
||||||
|
3. ❓ init 应该返回 FTS 和 vector 索引概况
|
||||||
|
|
||||||
|
**结论:用户的质疑 100% 正确!这是 CodexLens 的设计缺陷。**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 真实情况
|
||||||
|
|
||||||
|
### 1. 分层索引架构
|
||||||
|
|
||||||
|
CodexLens 使用**分层目录索引**:
|
||||||
|
|
||||||
|
```
|
||||||
|
D:\Claude_dms3\ccw\
|
||||||
|
├── _index.db ← 根目录索引(5个文件)
|
||||||
|
├── src/
|
||||||
|
│ ├── _index.db ← src目录索引(2个文件)
|
||||||
|
│ ├── tools/
|
||||||
|
│ │ └── _index.db ← tools子目录索引(25个文件)
|
||||||
|
│ └── ...
|
||||||
|
└── ... (总共 26 个 _index.db)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 索引覆盖情况
|
||||||
|
|
||||||
|
| 目录 | 文件数 | FTS索引 | Embeddings |
|
||||||
|
|------|--------|---------|------------|
|
||||||
|
| **根目录** | 5 | ✅ | ✅ (10 chunks) |
|
||||||
|
| bin/ | 2 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| dist/ | 4 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 |
|
||||||
|
| ... | ... | ... | ... |
|
||||||
|
| **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) |
|
||||||
|
|
||||||
|
### 3. 关键发现
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 运行检查脚本的结果
|
||||||
|
Total index databases: 26
|
||||||
|
Directories with embeddings: 1 # ❌ 只有根目录!
|
||||||
|
Total files indexed: 303 # ✅ FTS索引完整
|
||||||
|
Total semantic chunks: 10 # ❌ 只有根目录的5个文件
|
||||||
|
```
|
||||||
|
|
||||||
|
**问题**:
|
||||||
|
- ✅ **所有303个文件**都有 FTS 索引(分布在26个_index.db中)
|
||||||
|
- ❌ **只有5个文件**(1.6%)有 vector embeddings
|
||||||
|
- ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 为什么会这样?
|
||||||
|
|
||||||
|
### 原因分析
|
||||||
|
|
||||||
|
1. **`init` 操作**:
|
||||||
|
```bash
|
||||||
|
codexlens init .
|
||||||
|
```
|
||||||
|
- ✅ 为所有303个文件创建 FTS 索引(分布式)
|
||||||
|
- ⚠️ 尝试生成 embeddings,但遇到"Index already has 10 chunks"警告
|
||||||
|
- ❌ 只为根目录生成了 embeddings
|
||||||
|
|
||||||
|
2. **`embeddings-generate` 操作**:
|
||||||
|
```bash
|
||||||
|
codexlens embeddings-generate . --force
|
||||||
|
```
|
||||||
|
- ❌ 只处理了根目录的 _index.db
|
||||||
|
- ❌ **未递归处理子目录的索引**
|
||||||
|
- 结果:只有5个文档文件有 embeddings
|
||||||
|
|
||||||
|
### 设计问题
|
||||||
|
|
||||||
|
**CodexLens 的 embeddings 架构有缺陷**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 期望行为
|
||||||
|
for each _index.db in project:
|
||||||
|
generate_embeddings(index_db)
|
||||||
|
|
||||||
|
# 实际行为
|
||||||
|
generate_embeddings(root_index_db_only)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Init 返回信息缺陷
|
||||||
|
|
||||||
|
### 当前 `init` 的返回
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**问题**:
|
||||||
|
- ❌ 没有说明索引了多少文件
|
||||||
|
- ❌ 没有说明是否生成了 embeddings
|
||||||
|
- ❌ 没有说明 embeddings 覆盖率
|
||||||
|
|
||||||
|
### 应该返回的信息
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"message": "Index created successfully",
|
||||||
|
"stats": {
|
||||||
|
"total_files": 303,
|
||||||
|
"total_directories": 26,
|
||||||
|
"index_databases": 26,
|
||||||
|
"fts_coverage": {
|
||||||
|
"files": 303,
|
||||||
|
"percentage": 100.0
|
||||||
|
},
|
||||||
|
"embeddings_coverage": {
|
||||||
|
"files": 5,
|
||||||
|
"chunks": 10,
|
||||||
|
"percentage": 1.6,
|
||||||
|
"warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage."
|
||||||
|
},
|
||||||
|
"features": {
|
||||||
|
"exact_fts": true,
|
||||||
|
"fuzzy_fts": false,
|
||||||
|
"vector_search": "partial"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 解决方案
|
||||||
|
|
||||||
|
### 方案 1:递归生成 Embeddings(推荐)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 为所有子目录生成 embeddings
|
||||||
|
find .codexlens/indexes -name "_index.db" -exec \
|
||||||
|
python -m codexlens embeddings-generate {} --force \;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 方案 2:改进 Init 命令
|
||||||
|
|
||||||
|
```python
|
||||||
|
# codexlens/cli.py
|
||||||
|
def init_with_embeddings(project_root):
|
||||||
|
"""Initialize with recursive embeddings generation"""
|
||||||
|
# 1. Build FTS indexes (current behavior)
|
||||||
|
build_indexes(project_root)
|
||||||
|
|
||||||
|
# 2. Generate embeddings for ALL subdirs
|
||||||
|
for index_db in find_all_index_dbs(project_root):
|
||||||
|
if has_semantic_deps():
|
||||||
|
generate_embeddings(index_db)
|
||||||
|
|
||||||
|
# 3. Return comprehensive stats
|
||||||
|
return {
|
||||||
|
"fts_coverage": get_fts_stats(),
|
||||||
|
"embeddings_coverage": get_embeddings_stats(),
|
||||||
|
"features": detect_features()
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 方案 3:Smart Search 路由改进
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 当前逻辑
|
||||||
|
def classify_intent(query, hasIndex):
|
||||||
|
if not hasIndex:
|
||||||
|
return "ripgrep"
|
||||||
|
elif is_natural_language(query):
|
||||||
|
return "hybrid" # ❌ 但只有5个文件有embeddings!
|
||||||
|
else:
|
||||||
|
return "exact"
|
||||||
|
|
||||||
|
# 改进逻辑
|
||||||
|
def classify_intent(query, indexStatus):
|
||||||
|
embeddings_coverage = indexStatus.embeddings_coverage_percent
|
||||||
|
|
||||||
|
if embeddings_coverage < 50:
|
||||||
|
# 如果覆盖率<50%,即使是自然语言也降级到exact
|
||||||
|
return "exact" if indexStatus.indexed else "ripgrep"
|
||||||
|
elif is_natural_language(query):
|
||||||
|
return "hybrid"
|
||||||
|
else:
|
||||||
|
return "exact"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 验证用户质疑
|
||||||
|
|
||||||
|
### ❓ 为什么不为代码文件生成 embeddings?
|
||||||
|
|
||||||
|
**答**:不是"不为代码文件生成",而是:
|
||||||
|
- ✅ 代码文件都有 FTS 索引
|
||||||
|
- ❌ `embeddings-generate` 命令有BUG,**只处理根目录**
|
||||||
|
- ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表**
|
||||||
|
|
||||||
|
### ❓ FTS 和 Vector 应该索引相同内容
|
||||||
|
|
||||||
|
**答**:**完全正确!** 当前实际情况:
|
||||||
|
- FTS: 303/303 (100%)
|
||||||
|
- Vector: 5/303 (1.6%)
|
||||||
|
|
||||||
|
**这是严重的不一致性,违背了设计原则。**
|
||||||
|
|
||||||
|
### ❓ Init 应该返回索引概况
|
||||||
|
|
||||||
|
**答**:**完全正确!** 当前 init 只返回简单成功消息,应该返回:
|
||||||
|
- FTS 索引统计
|
||||||
|
- Embeddings 覆盖率
|
||||||
|
- 功能特性状态
|
||||||
|
- 警告信息(如果覆盖不完整)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 测试验证
|
||||||
|
|
||||||
|
### Hybrid Search 的实际效果
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// 当前查询
|
||||||
|
smart_search(query="authentication patterns", mode="hybrid")
|
||||||
|
|
||||||
|
// 实际搜索范围:
|
||||||
|
// ✅ 可搜索的文件:5个(根目录的.md文件)
|
||||||
|
// ❌ 不可搜索的文件:298个代码文件
|
||||||
|
// 结果:返回的都是文档文件,代码文件被忽略
|
||||||
|
```
|
||||||
|
|
||||||
|
### 修复后的效果(理想状态)
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// 修复后
|
||||||
|
smart_search(query="authentication patterns", mode="hybrid")
|
||||||
|
|
||||||
|
// 实际搜索范围:
|
||||||
|
// ✅ 可搜索的文件:303个(所有文件)
|
||||||
|
// 结果:包含代码文件和文档文件的综合结果
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 建议的修复优先级
|
||||||
|
|
||||||
|
### P0 - 紧急修复
|
||||||
|
|
||||||
|
1. **修复 `embeddings-generate` 命令**
|
||||||
|
- 递归处理所有子目录的 _index.db
|
||||||
|
- 为每个 _index.db 创建 semantic_chunks 表
|
||||||
|
|
||||||
|
2. **改进 `init` 返回信息**
|
||||||
|
- 返回详细的索引统计
|
||||||
|
- 显示 embeddings 覆盖率
|
||||||
|
- 如果覆盖不完整,给出警告
|
||||||
|
|
||||||
|
### P1 - 重要改进
|
||||||
|
|
||||||
|
3. **Smart Search 自适应路由**
|
||||||
|
- 检查 embeddings 覆盖率
|
||||||
|
- 如果覆盖率低,自动降级到 exact 模式
|
||||||
|
|
||||||
|
4. **Status 命令增强**
|
||||||
|
- 显示每个子目录的索引状态
|
||||||
|
- 显示 embeddings 分布情况
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 临时解决方案
|
||||||
|
|
||||||
|
### 当前推荐使用方式
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// 1. 文档搜索 - 使用 hybrid(有embeddings)
|
||||||
|
smart_search(query="architecture design patterns", mode="hybrid")
|
||||||
|
|
||||||
|
// 2. 代码搜索 - 使用 exact(无embeddings,但有FTS)
|
||||||
|
smart_search(query="function executeQuery", mode="exact")
|
||||||
|
|
||||||
|
// 3. 快速搜索 - 使用 ripgrep(跨所有文件)
|
||||||
|
smart_search(query="TODO", mode="ripgrep")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 完整覆盖的变通方案
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 手动为所有子目录生成 embeddings(如果CodexLens支持)
|
||||||
|
cd D:\Claude_dms3\ccw
|
||||||
|
|
||||||
|
# 为每个子目录分别运行
|
||||||
|
python -m codexlens embeddings-generate ./src/tools --force
|
||||||
|
python -m codexlens embeddings-generate ./src/commands --force
|
||||||
|
# ... 重复26次
|
||||||
|
|
||||||
|
# 或使用脚本自动化
|
||||||
|
python check_embeddings.py --generate-all
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 总结
|
||||||
|
|
||||||
|
| 用户质疑 | 状态 | 结论 |
|
||||||
|
|---------|------|------|
|
||||||
|
| 为什么不对代码生成embeddings? | ✅ 正确 | 是BUG,不是设计 |
|
||||||
|
| FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 |
|
||||||
|
| Init应返回详细概况 | ✅ 正确 | 当前信息不足 |
|
||||||
|
|
||||||
|
**用户的所有质疑都是正确的,揭示了 CodexLens 的三个核心问题:**
|
||||||
|
|
||||||
|
1. **Embeddings 生成不完整**(只有1.6%覆盖率)
|
||||||
|
2. **索引一致性问题**(FTS vs Vector)
|
||||||
|
3. **返回信息不透明**(缺少统计数据)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**生成时间**:2025-12-17
|
||||||
|
**验证方法**:`python check_embeddings.py`
|
||||||
47
ccw/check_embeddings.py
Normal file
47
ccw/check_embeddings.py
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
import sqlite3
|
||||||
|
import os
|
||||||
|
|
||||||
|
# Find all _index.db files
|
||||||
|
root_dir = r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw'
|
||||||
|
index_files = []
|
||||||
|
for dirpath, dirnames, filenames in os.walk(root_dir):
|
||||||
|
if '_index.db' in filenames:
|
||||||
|
index_files.append(os.path.join(dirpath, '_index.db'))
|
||||||
|
|
||||||
|
print(f'Found {len(index_files)} index databases\n')
|
||||||
|
|
||||||
|
total_files = 0
|
||||||
|
total_chunks = 0
|
||||||
|
dirs_with_chunks = 0
|
||||||
|
|
||||||
|
for db_path in sorted(index_files):
|
||||||
|
rel_path = db_path.replace(r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\\', '')
|
||||||
|
conn = sqlite3.connect(db_path)
|
||||||
|
|
||||||
|
try:
|
||||||
|
cursor = conn.execute('SELECT COUNT(*) FROM files')
|
||||||
|
file_count = cursor.fetchone()[0]
|
||||||
|
total_files += file_count
|
||||||
|
|
||||||
|
try:
|
||||||
|
cursor = conn.execute('SELECT COUNT(*) FROM semantic_chunks')
|
||||||
|
chunk_count = cursor.fetchone()[0]
|
||||||
|
total_chunks += chunk_count
|
||||||
|
|
||||||
|
if chunk_count > 0:
|
||||||
|
dirs_with_chunks += 1
|
||||||
|
print(f'[+] {rel_path:<40} Files: {file_count:3d} Chunks: {chunk_count:3d}')
|
||||||
|
else:
|
||||||
|
print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no chunks)')
|
||||||
|
except sqlite3.OperationalError:
|
||||||
|
print(f'[ ] {rel_path:<40} Files: {file_count:3d} (no semantic_chunks table)')
|
||||||
|
except Exception as e:
|
||||||
|
print(f'[!] {rel_path:<40} Error: {e}')
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
print(f'\n=== Summary ===')
|
||||||
|
print(f'Total index databases: {len(index_files)}')
|
||||||
|
print(f'Directories with embeddings: {dirs_with_chunks}')
|
||||||
|
print(f'Total files indexed: {total_files}')
|
||||||
|
print(f'Total semantic chunks: {total_chunks}')
|
||||||
@@ -1,12 +1,17 @@
|
|||||||
/**
|
/**
|
||||||
* Smart Search Tool - Unified search with mode-based execution
|
* Smart Search Tool - Unified intelligent search with CodexLens integration
|
||||||
* Modes: auto, exact, fuzzy, semantic, graph
|
|
||||||
*
|
*
|
||||||
* Features:
|
* Features:
|
||||||
* - Intent classification (auto mode)
|
* - Intent classification with automatic mode selection
|
||||||
* - Multi-backend search routing
|
* - CodexLens integration (init, hybrid, vector, semantic)
|
||||||
* - Result fusion with RRF ranking
|
* - Ripgrep fallback for exact mode
|
||||||
* - Configurable search parameters
|
* - Index status checking and warnings
|
||||||
|
* - Multi-backend search routing with RRF ranking
|
||||||
|
*
|
||||||
|
* Actions:
|
||||||
|
* - init: Initialize CodexLens index
|
||||||
|
* - search: Intelligent search with auto mode selection
|
||||||
|
* - status: Check index status
|
||||||
*/
|
*/
|
||||||
|
|
||||||
import { z } from 'zod';
|
import { z } from 'zod';
|
||||||
@@ -19,19 +24,23 @@ import {
|
|||||||
|
|
||||||
// Define Zod schema for validation
|
// Define Zod schema for validation
|
||||||
const ParamsSchema = z.object({
|
const ParamsSchema = z.object({
|
||||||
query: z.string().min(1, 'Query is required'),
|
action: z.enum(['init', 'search', 'search_files', 'status']).default('search'),
|
||||||
mode: z.enum(['auto', 'exact', 'fuzzy', 'semantic', 'graph']).default('auto'),
|
query: z.string().optional(),
|
||||||
|
mode: z.enum(['auto', 'hybrid', 'exact', 'ripgrep']).default('auto'),
|
||||||
output_mode: z.enum(['full', 'files_only', 'count']).default('full'),
|
output_mode: z.enum(['full', 'files_only', 'count']).default('full'),
|
||||||
|
path: z.string().optional(),
|
||||||
paths: z.array(z.string()).default([]),
|
paths: z.array(z.string()).default([]),
|
||||||
contextLines: z.number().default(0),
|
contextLines: z.number().default(0),
|
||||||
maxResults: z.number().default(100),
|
maxResults: z.number().default(100),
|
||||||
includeHidden: z.boolean().default(false),
|
includeHidden: z.boolean().default(false),
|
||||||
|
languages: z.array(z.string()).optional(),
|
||||||
|
limit: z.number().default(100),
|
||||||
});
|
});
|
||||||
|
|
||||||
type Params = z.infer<typeof ParamsSchema>;
|
type Params = z.infer<typeof ParamsSchema>;
|
||||||
|
|
||||||
// Search mode constants
|
// Search mode constants
|
||||||
const SEARCH_MODES = ['auto', 'exact', 'fuzzy', 'semantic', 'graph'] as const;
|
const SEARCH_MODES = ['auto', 'hybrid', 'exact', 'ripgrep'] as const;
|
||||||
|
|
||||||
// Classification confidence threshold
|
// Classification confidence threshold
|
||||||
const CONFIDENCE_THRESHOLD = 0.7;
|
const CONFIDENCE_THRESHOLD = 0.7;
|
||||||
@@ -70,16 +79,89 @@ interface SearchMetadata {
|
|||||||
classified_as?: string;
|
classified_as?: string;
|
||||||
confidence?: number;
|
confidence?: number;
|
||||||
reasoning?: string;
|
reasoning?: string;
|
||||||
|
embeddings_coverage_percent?: number;
|
||||||
warning?: string;
|
warning?: string;
|
||||||
note?: string;
|
note?: string;
|
||||||
|
index_status?: 'indexed' | 'not_indexed' | 'partial';
|
||||||
}
|
}
|
||||||
|
|
||||||
interface SearchResult {
|
interface SearchResult {
|
||||||
success: boolean;
|
success: boolean;
|
||||||
results?: ExactMatch[] | SemanticMatch[] | GraphMatch[];
|
results?: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown;
|
||||||
output?: string;
|
output?: string;
|
||||||
metadata?: SearchMetadata;
|
metadata?: SearchMetadata;
|
||||||
error?: string;
|
error?: string;
|
||||||
|
status?: unknown;
|
||||||
|
message?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
interface IndexStatus {
|
||||||
|
indexed: boolean;
|
||||||
|
has_embeddings: boolean;
|
||||||
|
file_count?: number;
|
||||||
|
embeddings_coverage_percent?: number;
|
||||||
|
warning?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Check if CodexLens index exists for current directory
|
||||||
|
* @param path - Directory path to check
|
||||||
|
* @returns Index status
|
||||||
|
*/
|
||||||
|
async function checkIndexStatus(path: string = '.'): Promise<IndexStatus> {
|
||||||
|
try {
|
||||||
|
const result = await executeCodexLens(['status', '--json'], { cwd: path });
|
||||||
|
|
||||||
|
if (!result.success) {
|
||||||
|
return {
|
||||||
|
indexed: false,
|
||||||
|
has_embeddings: false,
|
||||||
|
warning: 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse status output
|
||||||
|
try {
|
||||||
|
// Strip ANSI color codes from JSON output
|
||||||
|
const cleanOutput = (result.output || '{}').replace(/\x1b\[[0-9;]*m/g, '');
|
||||||
|
const status = JSON.parse(cleanOutput);
|
||||||
|
const indexed = status.indexed === true || status.file_count > 0;
|
||||||
|
|
||||||
|
// Get embeddings coverage from comprehensive status
|
||||||
|
const embeddingsData = status.embeddings || {};
|
||||||
|
const embeddingsCoverage = embeddingsData.coverage_percent || 0;
|
||||||
|
const has_embeddings = embeddingsCoverage >= 50; // Threshold: 50%
|
||||||
|
|
||||||
|
let warning: string | undefined;
|
||||||
|
if (!indexed) {
|
||||||
|
warning = 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.';
|
||||||
|
} else if (embeddingsCoverage === 0) {
|
||||||
|
warning = 'Index exists but no embeddings generated. Run: codexlens embeddings-generate --recursive';
|
||||||
|
} else if (embeddingsCoverage < 50) {
|
||||||
|
warning = `Embeddings coverage is ${embeddingsCoverage.toFixed(1)}% (below 50%). Hybrid search will use exact mode. Run: codexlens embeddings-generate --recursive`;
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
indexed,
|
||||||
|
has_embeddings,
|
||||||
|
file_count: status.file_count,
|
||||||
|
embeddings_coverage_percent: embeddingsCoverage,
|
||||||
|
warning,
|
||||||
|
};
|
||||||
|
} catch {
|
||||||
|
return {
|
||||||
|
indexed: false,
|
||||||
|
has_embeddings: false,
|
||||||
|
warning: 'Failed to parse index status',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
} catch {
|
||||||
|
return {
|
||||||
|
indexed: false,
|
||||||
|
has_embeddings: false,
|
||||||
|
warning: 'CodexLens not available',
|
||||||
|
};
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
@@ -123,43 +205,34 @@ function detectRelationship(query: string): boolean {
|
|||||||
|
|
||||||
/**
|
/**
|
||||||
* Classify query intent and recommend search mode
|
* Classify query intent and recommend search mode
|
||||||
|
* Simple mapping: hybrid (NL + index + embeddings) | exact (index or insufficient embeddings) | ripgrep (no index)
|
||||||
* @param query - Search query string
|
* @param query - Search query string
|
||||||
|
* @param hasIndex - Whether CodexLens index exists
|
||||||
|
* @param hasSufficientEmbeddings - Whether embeddings coverage >= 50%
|
||||||
* @returns Classification result
|
* @returns Classification result
|
||||||
*/
|
*/
|
||||||
function classifyIntent(query: string): Classification {
|
function classifyIntent(query: string, hasIndex: boolean = false, hasSufficientEmbeddings: boolean = false): Classification {
|
||||||
// Initialize mode scores
|
// Detect query patterns
|
||||||
const scores: Record<string, number> = {
|
const isNaturalLanguage = detectNaturalLanguage(query);
|
||||||
exact: 0,
|
|
||||||
fuzzy: 0,
|
|
||||||
semantic: 0,
|
|
||||||
graph: 0,
|
|
||||||
};
|
|
||||||
|
|
||||||
// Apply detection heuristics with weighted scoring
|
// Simple decision tree
|
||||||
if (detectLiteral(query)) {
|
let mode: string;
|
||||||
scores.exact += 0.8;
|
let confidence: number;
|
||||||
|
|
||||||
|
if (!hasIndex) {
|
||||||
|
// No index: use ripgrep
|
||||||
|
mode = 'ripgrep';
|
||||||
|
confidence = 1.0;
|
||||||
|
} else if (isNaturalLanguage && hasSufficientEmbeddings) {
|
||||||
|
// Natural language + sufficient embeddings: use hybrid
|
||||||
|
mode = 'hybrid';
|
||||||
|
confidence = 0.9;
|
||||||
|
} else {
|
||||||
|
// Simple query OR insufficient embeddings: use exact
|
||||||
|
mode = 'exact';
|
||||||
|
confidence = 0.8;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (detectRegex(query)) {
|
|
||||||
scores.fuzzy += 0.7;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (detectNaturalLanguage(query)) {
|
|
||||||
scores.semantic += 0.9;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (detectFilePath(query)) {
|
|
||||||
scores.exact += 0.6;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (detectRelationship(query)) {
|
|
||||||
scores.graph += 0.85;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Find mode with highest confidence score
|
|
||||||
const mode = Object.keys(scores).reduce((a, b) => (scores[a] > scores[b] ? a : b));
|
|
||||||
const confidence = scores[mode];
|
|
||||||
|
|
||||||
// Build reasoning string
|
// Build reasoning string
|
||||||
const detectedPatterns: string[] = [];
|
const detectedPatterns: string[] = [];
|
||||||
if (detectLiteral(query)) detectedPatterns.push('literal');
|
if (detectLiteral(query)) detectedPatterns.push('literal');
|
||||||
@@ -168,7 +241,7 @@ function classifyIntent(query: string): Classification {
|
|||||||
if (detectFilePath(query)) detectedPatterns.push('file path');
|
if (detectFilePath(query)) detectedPatterns.push('file path');
|
||||||
if (detectRelationship(query)) detectedPatterns.push('relationship');
|
if (detectRelationship(query)) detectedPatterns.push('relationship');
|
||||||
|
|
||||||
const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')})`;
|
const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')}, index: ${hasIndex ? 'available' : 'not available'}, embeddings: ${hasSufficientEmbeddings ? 'sufficient' : 'insufficient'})`;
|
||||||
|
|
||||||
return { mode, confidence, reasoning };
|
return { mode, confidence, reasoning };
|
||||||
}
|
}
|
||||||
@@ -234,105 +307,192 @@ function buildRipgrepCommand(params: {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Mode: auto - Intent classification and mode selection
|
* Action: init - Initialize CodexLens index
|
||||||
* Analyzes query to determine optimal search mode
|
|
||||||
*/
|
*/
|
||||||
async function executeAutoMode(params: Params): Promise<SearchResult> {
|
async function executeInitAction(params: Params): Promise<SearchResult> {
|
||||||
const { query } = params;
|
const { path = '.', languages } = params;
|
||||||
|
|
||||||
// Classify intent
|
// Check CodexLens availability
|
||||||
const classification = classifyIntent(query);
|
const readyStatus = await ensureCodexLensReady();
|
||||||
|
if (!readyStatus.ready) {
|
||||||
// Route to appropriate mode based on classification
|
|
||||||
switch (classification.mode) {
|
|
||||||
case 'exact': {
|
|
||||||
const exactResult = await executeExactMode(params);
|
|
||||||
return {
|
|
||||||
...exactResult,
|
|
||||||
metadata: {
|
|
||||||
...exactResult.metadata!,
|
|
||||||
classified_as: classification.mode,
|
|
||||||
confidence: classification.confidence,
|
|
||||||
reasoning: classification.reasoning,
|
|
||||||
},
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
case 'fuzzy':
|
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: 'Fuzzy mode not yet implemented',
|
error: `CodexLens not available: ${readyStatus.error}. CodexLens will be auto-installed on first use.`,
|
||||||
metadata: {
|
|
||||||
mode: 'fuzzy',
|
|
||||||
backend: '',
|
|
||||||
count: 0,
|
|
||||||
query,
|
|
||||||
classified_as: classification.mode,
|
|
||||||
confidence: classification.confidence,
|
|
||||||
reasoning: classification.reasoning,
|
|
||||||
},
|
|
||||||
};
|
|
||||||
|
|
||||||
case 'semantic': {
|
|
||||||
const semanticResult = await executeSemanticMode(params);
|
|
||||||
return {
|
|
||||||
...semanticResult,
|
|
||||||
metadata: {
|
|
||||||
...semanticResult.metadata!,
|
|
||||||
classified_as: classification.mode,
|
|
||||||
confidence: classification.confidence,
|
|
||||||
reasoning: classification.reasoning,
|
|
||||||
},
|
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
case 'graph': {
|
const args = ['init', path];
|
||||||
const graphResult = await executeGraphMode(params);
|
if (languages && languages.length > 0) {
|
||||||
return {
|
args.push('--languages', languages.join(','));
|
||||||
...graphResult,
|
|
||||||
metadata: {
|
|
||||||
...graphResult.metadata!,
|
|
||||||
classified_as: classification.mode,
|
|
||||||
confidence: classification.confidence,
|
|
||||||
reasoning: classification.reasoning,
|
|
||||||
},
|
|
||||||
};
|
|
||||||
}
|
}
|
||||||
|
|
||||||
default: {
|
const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
|
||||||
const fallbackResult = await executeExactMode(params);
|
|
||||||
return {
|
return {
|
||||||
...fallbackResult,
|
success: result.success,
|
||||||
metadata: {
|
error: result.error,
|
||||||
...fallbackResult.metadata!,
|
message: result.success
|
||||||
classified_as: 'exact',
|
? `CodexLens index created successfully for ${path}`
|
||||||
confidence: 0.5,
|
: undefined,
|
||||||
reasoning: 'Fallback to exact mode due to unknown classification',
|
|
||||||
},
|
|
||||||
};
|
};
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Mode: exact - Precise file path and content matching
|
* Action: status - Check CodexLens index status
|
||||||
* Uses ripgrep for literal string matching
|
|
||||||
*/
|
*/
|
||||||
async function executeExactMode(params: Params): Promise<SearchResult> {
|
async function executeStatusAction(params: Params): Promise<SearchResult> {
|
||||||
const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false } = params;
|
const { path = '.' } = params;
|
||||||
|
|
||||||
// Check ripgrep availability
|
const indexStatus = await checkIndexStatus(path);
|
||||||
if (!checkToolAvailability('rg')) {
|
|
||||||
|
return {
|
||||||
|
success: true,
|
||||||
|
status: indexStatus,
|
||||||
|
message: indexStatus.warning || `Index status: ${indexStatus.indexed ? 'indexed' : 'not indexed'}, embeddings: ${indexStatus.has_embeddings ? 'available' : 'not available'}`,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Mode: auto - Intent classification and mode selection
|
||||||
|
* Routes to: hybrid (NL + index) | exact (index) | ripgrep (no index)
|
||||||
|
*/
|
||||||
|
async function executeAutoMode(params: Params): Promise<SearchResult> {
|
||||||
|
const { query, path = '.' } = params;
|
||||||
|
|
||||||
|
if (!query) {
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: 'ripgrep not available - please install ripgrep (rg) to use exact search mode',
|
error: 'Query is required for search action',
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
// Build ripgrep command
|
// Check index status
|
||||||
|
const indexStatus = await checkIndexStatus(path);
|
||||||
|
|
||||||
|
// Classify intent with index and embeddings awareness
|
||||||
|
const classification = classifyIntent(
|
||||||
|
query,
|
||||||
|
indexStatus.indexed,
|
||||||
|
indexStatus.has_embeddings // This now considers 50% threshold
|
||||||
|
);
|
||||||
|
|
||||||
|
// Route to appropriate mode based on classification
|
||||||
|
let result: SearchResult;
|
||||||
|
|
||||||
|
switch (classification.mode) {
|
||||||
|
case 'hybrid':
|
||||||
|
result = await executeHybridMode(params);
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 'exact':
|
||||||
|
result = await executeCodexLensExactMode(params);
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 'ripgrep':
|
||||||
|
result = await executeRipgrepMode(params);
|
||||||
|
break;
|
||||||
|
|
||||||
|
default:
|
||||||
|
// Fallback to ripgrep
|
||||||
|
result = await executeRipgrepMode(params);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Add classification metadata
|
||||||
|
if (result.metadata) {
|
||||||
|
result.metadata.classified_as = classification.mode;
|
||||||
|
result.metadata.confidence = classification.confidence;
|
||||||
|
result.metadata.reasoning = classification.reasoning;
|
||||||
|
result.metadata.embeddings_coverage_percent = indexStatus.embeddings_coverage_percent;
|
||||||
|
result.metadata.index_status = indexStatus.indexed
|
||||||
|
? (indexStatus.has_embeddings ? 'indexed' : 'partial')
|
||||||
|
: 'not_indexed';
|
||||||
|
|
||||||
|
// Add warning if needed
|
||||||
|
if (indexStatus.warning) {
|
||||||
|
result.metadata.warning = indexStatus.warning;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Mode: ripgrep - Fast literal string matching using ripgrep
|
||||||
|
* No index required, fallback to CodexLens if ripgrep unavailable
|
||||||
|
*/
|
||||||
|
async function executeRipgrepMode(params: Params): Promise<SearchResult> {
|
||||||
|
const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false, path = '.' } = params;
|
||||||
|
|
||||||
|
if (!query) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: 'Query is required for search',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check if ripgrep is available
|
||||||
|
const hasRipgrep = checkToolAvailability('rg');
|
||||||
|
|
||||||
|
// If ripgrep not available, fall back to CodexLens exact mode
|
||||||
|
if (!hasRipgrep) {
|
||||||
|
const readyStatus = await ensureCodexLensReady();
|
||||||
|
if (!readyStatus.ready) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: 'Neither ripgrep nor CodexLens available. Install ripgrep (rg) or CodexLens for search functionality.',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use CodexLens exact mode as fallback
|
||||||
|
const args = ['search', query, '--limit', maxResults.toString(), '--mode', 'exact', '--json'];
|
||||||
|
const result = await executeCodexLens(args, { cwd: path });
|
||||||
|
|
||||||
|
if (!result.success) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: result.error,
|
||||||
|
metadata: {
|
||||||
|
mode: 'ripgrep',
|
||||||
|
backend: 'codexlens-fallback',
|
||||||
|
count: 0,
|
||||||
|
query,
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse results
|
||||||
|
let results: SemanticMatch[] = [];
|
||||||
|
try {
|
||||||
|
const parsed = JSON.parse(result.output || '{}');
|
||||||
|
const data = parsed.results || parsed;
|
||||||
|
results = (Array.isArray(data) ? data : []).map((item: any) => ({
|
||||||
|
file: item.path || item.file,
|
||||||
|
score: item.score || 0,
|
||||||
|
content: item.excerpt || item.content || '',
|
||||||
|
symbol: item.symbol || null,
|
||||||
|
}));
|
||||||
|
} catch {
|
||||||
|
// Keep empty results
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
success: true,
|
||||||
|
results,
|
||||||
|
metadata: {
|
||||||
|
mode: 'ripgrep',
|
||||||
|
backend: 'codexlens-fallback',
|
||||||
|
count: results.length,
|
||||||
|
query,
|
||||||
|
note: 'Using CodexLens exact mode (ripgrep not available)',
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use ripgrep
|
||||||
const { command, args } = buildRipgrepCommand({
|
const { command, args } = buildRipgrepCommand({
|
||||||
query,
|
query,
|
||||||
paths: paths.length > 0 ? paths : ['.'],
|
paths: paths.length > 0 ? paths : [path],
|
||||||
contextLines,
|
contextLines,
|
||||||
maxResults,
|
maxResults,
|
||||||
includeHidden,
|
includeHidden,
|
||||||
@@ -340,7 +500,7 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
|
|||||||
|
|
||||||
return new Promise((resolve) => {
|
return new Promise((resolve) => {
|
||||||
const child = spawn(command, args, {
|
const child = spawn(command, args, {
|
||||||
cwd: process.cwd(),
|
cwd: path || process.cwd(),
|
||||||
stdio: ['ignore', 'pipe', 'pipe'],
|
stdio: ['ignore', 'pipe', 'pipe'],
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -386,7 +546,7 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
|
|||||||
success: true,
|
success: true,
|
||||||
results,
|
results,
|
||||||
metadata: {
|
metadata: {
|
||||||
mode: 'exact',
|
mode: 'ripgrep',
|
||||||
backend: 'ripgrep',
|
backend: 'ripgrep',
|
||||||
count: results.length,
|
count: results.length,
|
||||||
query,
|
query,
|
||||||
@@ -412,60 +572,126 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Mode: fuzzy - Approximate matching with tolerance
|
* Mode: exact - CodexLens exact/FTS search
|
||||||
* Uses fuzzy matching algorithms for typo-tolerant search
|
* Requires index
|
||||||
*/
|
*/
|
||||||
async function executeFuzzyMode(params: Params): Promise<SearchResult> {
|
async function executeCodexLensExactMode(params: Params): Promise<SearchResult> {
|
||||||
|
const { query, path = '.', limit = 100 } = params;
|
||||||
|
|
||||||
|
if (!query) {
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: 'Fuzzy mode not implemented - fuzzy matching engine pending',
|
error: 'Query is required for search',
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
/**
|
|
||||||
* Mode: semantic - Natural language understanding search
|
|
||||||
* Uses CodexLens embeddings for semantic similarity
|
|
||||||
*/
|
|
||||||
async function executeSemanticMode(params: Params): Promise<SearchResult> {
|
|
||||||
const { query, paths = [], maxResults = 100 } = params;
|
|
||||||
|
|
||||||
// Check CodexLens availability
|
// Check CodexLens availability
|
||||||
const readyStatus = await ensureCodexLensReady();
|
const readyStatus = await ensureCodexLensReady();
|
||||||
if (!readyStatus.ready) {
|
if (!readyStatus.ready) {
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`,
|
error: `CodexLens not available: ${readyStatus.error}`,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
// Determine search path
|
// Check index status
|
||||||
const searchPath = paths.length > 0 ? paths[0] : '.';
|
const indexStatus = await checkIndexStatus(path);
|
||||||
|
|
||||||
// Execute CodexLens semantic search
|
const args = ['search', query, '--limit', limit.toString(), '--mode', 'exact', '--json'];
|
||||||
const result = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], {
|
const result = await executeCodexLens(args, { cwd: path });
|
||||||
cwd: searchPath,
|
|
||||||
});
|
|
||||||
|
|
||||||
if (!result.success) {
|
if (!result.success) {
|
||||||
return {
|
return {
|
||||||
success: false,
|
success: false,
|
||||||
error: result.error,
|
error: result.error,
|
||||||
metadata: {
|
metadata: {
|
||||||
mode: 'semantic',
|
mode: 'exact',
|
||||||
backend: 'codexlens',
|
backend: 'codexlens',
|
||||||
count: 0,
|
count: 0,
|
||||||
query,
|
query,
|
||||||
|
warning: indexStatus.warning,
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
// Parse and transform results
|
// Parse results
|
||||||
let results: SemanticMatch[] = [];
|
let results: SemanticMatch[] = [];
|
||||||
try {
|
try {
|
||||||
const cleanOutput = result.output!.replace(/\r\n/g, '\n');
|
const parsed = JSON.parse(result.output || '{}');
|
||||||
const parsed = JSON.parse(cleanOutput);
|
const data = parsed.results || parsed;
|
||||||
const data = parsed.result || parsed;
|
results = (Array.isArray(data) ? data : []).map((item: any) => ({
|
||||||
results = (data.results || []).map((item: any) => ({
|
file: item.path || item.file,
|
||||||
|
score: item.score || 0,
|
||||||
|
content: item.excerpt || item.content || '',
|
||||||
|
symbol: item.symbol || null,
|
||||||
|
}));
|
||||||
|
} catch {
|
||||||
|
// Keep empty results
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
success: true,
|
||||||
|
results,
|
||||||
|
metadata: {
|
||||||
|
mode: 'exact',
|
||||||
|
backend: 'codexlens',
|
||||||
|
count: results.length,
|
||||||
|
query,
|
||||||
|
warning: indexStatus.warning,
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Mode: hybrid - Best quality search with RRF fusion
|
||||||
|
* Uses CodexLens hybrid mode (exact + fuzzy + vector)
|
||||||
|
* Requires index with embeddings
|
||||||
|
*/
|
||||||
|
async function executeHybridMode(params: Params): Promise<SearchResult> {
|
||||||
|
const { query, path = '.', limit = 100 } = params;
|
||||||
|
|
||||||
|
if (!query) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: 'Query is required for search',
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check CodexLens availability
|
||||||
|
const readyStatus = await ensureCodexLensReady();
|
||||||
|
if (!readyStatus.ready) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: `CodexLens not available: ${readyStatus.error}`,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check index status
|
||||||
|
const indexStatus = await checkIndexStatus(path);
|
||||||
|
|
||||||
|
const args = ['search', query, '--limit', limit.toString(), '--mode', 'hybrid', '--json'];
|
||||||
|
const result = await executeCodexLens(args, { cwd: path });
|
||||||
|
|
||||||
|
if (!result.success) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
error: result.error,
|
||||||
|
metadata: {
|
||||||
|
mode: 'hybrid',
|
||||||
|
backend: 'codexlens',
|
||||||
|
count: 0,
|
||||||
|
query,
|
||||||
|
warning: indexStatus.warning,
|
||||||
|
},
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse results
|
||||||
|
let results: SemanticMatch[] = [];
|
||||||
|
try {
|
||||||
|
const parsed = JSON.parse(result.output || '{}');
|
||||||
|
const data = parsed.results || parsed;
|
||||||
|
results = (Array.isArray(data) ? data : []).map((item: any) => ({
|
||||||
file: item.path || item.file,
|
file: item.path || item.file,
|
||||||
score: item.score || 0,
|
score: item.score || 0,
|
||||||
content: item.excerpt || item.content || '',
|
content: item.excerpt || item.content || '',
|
||||||
@@ -477,11 +703,11 @@ async function executeSemanticMode(params: Params): Promise<SearchResult> {
|
|||||||
results: [],
|
results: [],
|
||||||
output: result.output,
|
output: result.output,
|
||||||
metadata: {
|
metadata: {
|
||||||
mode: 'semantic',
|
mode: 'hybrid',
|
||||||
backend: 'codexlens',
|
backend: 'codexlens',
|
||||||
count: 0,
|
count: 0,
|
||||||
query,
|
query,
|
||||||
warning: 'Failed to parse JSON output',
|
warning: indexStatus.warning || 'Failed to parse JSON output',
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
@@ -490,105 +716,12 @@ async function executeSemanticMode(params: Params): Promise<SearchResult> {
|
|||||||
success: true,
|
success: true,
|
||||||
results,
|
results,
|
||||||
metadata: {
|
metadata: {
|
||||||
mode: 'semantic',
|
mode: 'hybrid',
|
||||||
backend: 'codexlens',
|
backend: 'codexlens',
|
||||||
count: results.length,
|
count: results.length,
|
||||||
query,
|
query,
|
||||||
},
|
note: 'Hybrid mode uses RRF fusion (exact + fuzzy + vector) for best results',
|
||||||
};
|
warning: indexStatus.warning,
|
||||||
}
|
|
||||||
|
|
||||||
/**
|
|
||||||
* Mode: graph - Dependency and relationship traversal
|
|
||||||
* Uses CodexLens symbol extraction for code analysis
|
|
||||||
*/
|
|
||||||
async function executeGraphMode(params: Params): Promise<SearchResult> {
|
|
||||||
const { query, paths = [], maxResults = 100 } = params;
|
|
||||||
|
|
||||||
// Check CodexLens availability
|
|
||||||
const readyStatus = await ensureCodexLensReady();
|
|
||||||
if (!readyStatus.ready) {
|
|
||||||
return {
|
|
||||||
success: false,
|
|
||||||
error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`,
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
// First, search for relevant files using text search
|
|
||||||
const searchPath = paths.length > 0 ? paths[0] : '.';
|
|
||||||
|
|
||||||
const textResult = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], {
|
|
||||||
cwd: searchPath,
|
|
||||||
});
|
|
||||||
|
|
||||||
if (!textResult.success) {
|
|
||||||
return {
|
|
||||||
success: false,
|
|
||||||
error: textResult.error,
|
|
||||||
metadata: {
|
|
||||||
mode: 'graph',
|
|
||||||
backend: 'codexlens',
|
|
||||||
count: 0,
|
|
||||||
query,
|
|
||||||
},
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
// Parse results and extract symbols from top files
|
|
||||||
let results: GraphMatch[] = [];
|
|
||||||
try {
|
|
||||||
const parsed = JSON.parse(textResult.output!);
|
|
||||||
const files = [...new Set((parsed.results || parsed).map((item: any) => item.path || item.file))].slice(
|
|
||||||
0,
|
|
||||||
10
|
|
||||||
);
|
|
||||||
|
|
||||||
// Extract symbols from files in parallel
|
|
||||||
const symbolPromises = files.map((file) =>
|
|
||||||
executeCodexLens(['symbol', file as string, '--json'], { cwd: searchPath }).then((result) => ({
|
|
||||||
file,
|
|
||||||
result,
|
|
||||||
}))
|
|
||||||
);
|
|
||||||
|
|
||||||
const symbolResults = await Promise.all(symbolPromises);
|
|
||||||
|
|
||||||
for (const { file, result } of symbolResults) {
|
|
||||||
if (result.success) {
|
|
||||||
try {
|
|
||||||
const symbols = JSON.parse(result.output!);
|
|
||||||
results.push({
|
|
||||||
file: file as string,
|
|
||||||
symbols: symbols.symbols || symbols,
|
|
||||||
relationships: [],
|
|
||||||
});
|
|
||||||
} catch {
|
|
||||||
// Skip files with parse errors
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
} catch {
|
|
||||||
return {
|
|
||||||
success: false,
|
|
||||||
error: 'Failed to parse search results',
|
|
||||||
metadata: {
|
|
||||||
mode: 'graph',
|
|
||||||
backend: 'codexlens',
|
|
||||||
count: 0,
|
|
||||||
query,
|
|
||||||
},
|
|
||||||
};
|
|
||||||
}
|
|
||||||
|
|
||||||
return {
|
|
||||||
success: true,
|
|
||||||
results,
|
|
||||||
metadata: {
|
|
||||||
mode: 'graph',
|
|
||||||
backend: 'codexlens',
|
|
||||||
count: results.length,
|
|
||||||
query,
|
|
||||||
note: 'Graph mode provides symbol extraction; full dependency graph analysis pending',
|
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
@@ -596,36 +729,73 @@ async function executeGraphMode(params: Params): Promise<SearchResult> {
|
|||||||
// Tool schema for MCP
|
// Tool schema for MCP
|
||||||
export const schema: ToolSchema = {
|
export const schema: ToolSchema = {
|
||||||
name: 'smart_search',
|
name: 'smart_search',
|
||||||
description: `Intelligent code search with multiple modes.
|
description: `Intelligent code search with three optimized modes: hybrid, exact, ripgrep.
|
||||||
|
|
||||||
Usage:
|
**Quick Start:**
|
||||||
smart_search(query="function main", path=".") # Auto-select mode
|
smart_search(query="authentication logic") # Auto mode (intelligent routing)
|
||||||
smart_search(query="def init", mode="exact") # Exact match
|
smart_search(action="init", path=".") # Initialize index (required for hybrid)
|
||||||
smart_search(query="authentication logic", mode="semantic") # NL search
|
smart_search(action="status") # Check index status
|
||||||
|
|
||||||
Modes: auto (default), exact, fuzzy, semantic, graph`,
|
**Three Core Modes:**
|
||||||
|
1. auto (default): Intelligent routing based on query and index
|
||||||
|
- Natural language + index → hybrid
|
||||||
|
- Simple query + index → exact
|
||||||
|
- No index → ripgrep
|
||||||
|
|
||||||
|
2. hybrid: CodexLens RRF fusion (exact + fuzzy + vector)
|
||||||
|
- Best quality, semantic understanding
|
||||||
|
- Requires index with embeddings
|
||||||
|
|
||||||
|
3. exact: CodexLens FTS (full-text search)
|
||||||
|
- Precise keyword matching
|
||||||
|
- Requires index
|
||||||
|
|
||||||
|
4. ripgrep: Direct ripgrep execution
|
||||||
|
- Fast, no index required
|
||||||
|
- Literal string matching
|
||||||
|
|
||||||
|
**Actions:**
|
||||||
|
- search (default): Intelligent search with auto routing
|
||||||
|
- init: Create CodexLens index (required for hybrid/exact)
|
||||||
|
- status: Check index and embedding availability
|
||||||
|
- search_files: Return file paths only
|
||||||
|
|
||||||
|
**Workflow:**
|
||||||
|
1. Run action="init" to create index
|
||||||
|
2. Use auto mode - it routes to hybrid for NL queries, exact for simple queries
|
||||||
|
3. Use ripgrep mode for fast searches without index`,
|
||||||
inputSchema: {
|
inputSchema: {
|
||||||
type: 'object',
|
type: 'object',
|
||||||
properties: {
|
properties: {
|
||||||
|
action: {
|
||||||
|
type: 'string',
|
||||||
|
enum: ['init', 'search', 'search_files', 'status'],
|
||||||
|
description: 'Action to perform: init (create index), search (default), search_files (paths only), status (check index)',
|
||||||
|
default: 'search',
|
||||||
|
},
|
||||||
query: {
|
query: {
|
||||||
type: 'string',
|
type: 'string',
|
||||||
description: 'Search query (file pattern, text content, or natural language)',
|
description: 'Search query (required for search/search_files actions)',
|
||||||
},
|
},
|
||||||
mode: {
|
mode: {
|
||||||
type: 'string',
|
type: 'string',
|
||||||
enum: SEARCH_MODES,
|
enum: SEARCH_MODES,
|
||||||
description: 'Search mode (default: auto)',
|
description: 'Search mode: auto (default), hybrid (best quality), exact (CodexLens FTS), ripgrep (fast, no index)',
|
||||||
default: 'auto',
|
default: 'auto',
|
||||||
},
|
},
|
||||||
output_mode: {
|
output_mode: {
|
||||||
type: 'string',
|
type: 'string',
|
||||||
enum: ['full', 'files_only', 'count'],
|
enum: ['full', 'files_only', 'count'],
|
||||||
description: 'Output mode: full (default), files_only (paths only), count (per-file counts)',
|
description: 'Output format: full (default), files_only (paths only), count (per-file counts)',
|
||||||
default: 'full',
|
default: 'full',
|
||||||
},
|
},
|
||||||
|
path: {
|
||||||
|
type: 'string',
|
||||||
|
description: 'Directory path for init/search actions (default: current directory)',
|
||||||
|
},
|
||||||
paths: {
|
paths: {
|
||||||
type: 'array',
|
type: 'array',
|
||||||
description: 'Paths to search within (default: current directory)',
|
description: 'Multiple paths to search within (for search action)',
|
||||||
items: {
|
items: {
|
||||||
type: 'string',
|
type: 'string',
|
||||||
},
|
},
|
||||||
@@ -633,21 +803,31 @@ Modes: auto (default), exact, fuzzy, semantic, graph`,
|
|||||||
},
|
},
|
||||||
contextLines: {
|
contextLines: {
|
||||||
type: 'number',
|
type: 'number',
|
||||||
description: 'Number of context lines around matches (default: 0)',
|
description: 'Number of context lines around matches (exact mode only)',
|
||||||
default: 0,
|
default: 0,
|
||||||
},
|
},
|
||||||
maxResults: {
|
maxResults: {
|
||||||
type: 'number',
|
type: 'number',
|
||||||
description: 'Maximum number of results to return (default: 100)',
|
description: 'Maximum number of results (default: 100)',
|
||||||
|
default: 100,
|
||||||
|
},
|
||||||
|
limit: {
|
||||||
|
type: 'number',
|
||||||
|
description: 'Alias for maxResults',
|
||||||
default: 100,
|
default: 100,
|
||||||
},
|
},
|
||||||
includeHidden: {
|
includeHidden: {
|
||||||
type: 'boolean',
|
type: 'boolean',
|
||||||
description: 'Include hidden files/directories (default: false)',
|
description: 'Include hidden files/directories',
|
||||||
default: false,
|
default: false,
|
||||||
},
|
},
|
||||||
|
languages: {
|
||||||
|
type: 'array',
|
||||||
|
items: { type: 'string' },
|
||||||
|
description: 'Languages to index (for init action). Example: ["javascript", "typescript"]',
|
||||||
},
|
},
|
||||||
required: ['query'],
|
},
|
||||||
|
required: [],
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -655,20 +835,27 @@ Modes: auto (default), exact, fuzzy, semantic, graph`,
|
|||||||
* Transform results based on output_mode
|
* Transform results based on output_mode
|
||||||
*/
|
*/
|
||||||
function transformOutput(
|
function transformOutput(
|
||||||
results: ExactMatch[] | SemanticMatch[] | GraphMatch[],
|
results: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown[],
|
||||||
outputMode: 'full' | 'files_only' | 'count'
|
outputMode: 'full' | 'files_only' | 'count'
|
||||||
): unknown {
|
): unknown {
|
||||||
|
if (!Array.isArray(results)) {
|
||||||
|
return results;
|
||||||
|
}
|
||||||
|
|
||||||
switch (outputMode) {
|
switch (outputMode) {
|
||||||
case 'files_only': {
|
case 'files_only': {
|
||||||
// Extract unique file paths
|
// Extract unique file paths
|
||||||
const files = [...new Set(results.map((r) => r.file))];
|
const files = [...new Set(results.map((r: any) => r.file))].filter(Boolean);
|
||||||
return { files, count: files.length };
|
return { files, count: files.length };
|
||||||
}
|
}
|
||||||
case 'count': {
|
case 'count': {
|
||||||
// Count matches per file
|
// Count matches per file
|
||||||
const counts: Record<string, number> = {};
|
const counts: Record<string, number> = {};
|
||||||
for (const r of results) {
|
for (const r of results) {
|
||||||
counts[r.file] = (counts[r.file] || 0) + 1;
|
const file = (r as any).file;
|
||||||
|
if (file) {
|
||||||
|
counts[file] = (counts[file] || 0) + 1;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
return {
|
return {
|
||||||
files: Object.entries(counts).map(([file, count]) => ({ file, count })),
|
files: Object.entries(counts).map(([file, count]) => ({ file, count })),
|
||||||
@@ -688,34 +875,58 @@ export async function handler(params: Record<string, unknown>): Promise<ToolResu
|
|||||||
return { success: false, error: `Invalid params: ${parsed.error.message}` };
|
return { success: false, error: `Invalid params: ${parsed.error.message}` };
|
||||||
}
|
}
|
||||||
|
|
||||||
const { mode, output_mode } = parsed.data;
|
const { action, mode, output_mode, limit, maxResults } = parsed.data;
|
||||||
|
|
||||||
|
// Use limit if maxResults not provided
|
||||||
|
if (limit && !maxResults) {
|
||||||
|
parsed.data.maxResults = limit;
|
||||||
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
let result: SearchResult;
|
let result: SearchResult;
|
||||||
|
|
||||||
|
// Handle actions
|
||||||
|
switch (action) {
|
||||||
|
case 'init':
|
||||||
|
result = await executeInitAction(parsed.data);
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 'status':
|
||||||
|
result = await executeStatusAction(parsed.data);
|
||||||
|
break;
|
||||||
|
|
||||||
|
case 'search_files':
|
||||||
|
// For search_files, use search mode but force files_only output
|
||||||
|
parsed.data.output_mode = 'files_only';
|
||||||
|
// Fall through to search
|
||||||
|
|
||||||
|
case 'search':
|
||||||
|
default:
|
||||||
|
// Handle search modes: auto | hybrid | exact | ripgrep
|
||||||
switch (mode) {
|
switch (mode) {
|
||||||
case 'auto':
|
case 'auto':
|
||||||
result = await executeAutoMode(parsed.data);
|
result = await executeAutoMode(parsed.data);
|
||||||
break;
|
break;
|
||||||
|
case 'hybrid':
|
||||||
|
result = await executeHybridMode(parsed.data);
|
||||||
|
break;
|
||||||
case 'exact':
|
case 'exact':
|
||||||
result = await executeExactMode(parsed.data);
|
result = await executeCodexLensExactMode(parsed.data);
|
||||||
break;
|
break;
|
||||||
case 'fuzzy':
|
case 'ripgrep':
|
||||||
result = await executeFuzzyMode(parsed.data);
|
result = await executeRipgrepMode(parsed.data);
|
||||||
break;
|
|
||||||
case 'semantic':
|
|
||||||
result = await executeSemanticMode(parsed.data);
|
|
||||||
break;
|
|
||||||
case 'graph':
|
|
||||||
result = await executeGraphMode(parsed.data);
|
|
||||||
break;
|
break;
|
||||||
default:
|
default:
|
||||||
throw new Error(`Unsupported mode: ${mode}`);
|
throw new Error(`Unsupported mode: ${mode}. Use: auto, hybrid, exact, or ripgrep`);
|
||||||
|
}
|
||||||
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Transform output based on output_mode
|
// Transform output based on output_mode (for search actions only)
|
||||||
|
if (action === 'search' || action === 'search_files') {
|
||||||
if (result.success && result.results && output_mode !== 'full') {
|
if (result.success && result.results && output_mode !== 'full') {
|
||||||
result.results = transformOutput(result.results, output_mode) as typeof result.results;
|
result.results = transformOutput(result.results as any[], output_mode);
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
return result.success ? { success: true, result } : { success: false, error: result.error };
|
return result.success ? { success: true, result } : { success: false, error: result.error };
|
||||||
|
|||||||
@@ -142,11 +142,11 @@ def init(
|
|||||||
if not no_embeddings:
|
if not no_embeddings:
|
||||||
try:
|
try:
|
||||||
from codexlens.semantic import SEMANTIC_AVAILABLE
|
from codexlens.semantic import SEMANTIC_AVAILABLE
|
||||||
from codexlens.cli.embedding_manager import generate_embeddings
|
from codexlens.cli.embedding_manager import generate_embeddings_recursive, get_embeddings_status
|
||||||
|
|
||||||
if SEMANTIC_AVAILABLE:
|
if SEMANTIC_AVAILABLE:
|
||||||
# Find the index file
|
# Use the index root directory (not the _index.db file)
|
||||||
index_path = Path(build_result.index_root) / "_index.db"
|
index_root = Path(build_result.index_root)
|
||||||
|
|
||||||
if not json_mode:
|
if not json_mode:
|
||||||
console.print("\n[bold]Generating embeddings...[/bold]")
|
console.print("\n[bold]Generating embeddings...[/bold]")
|
||||||
@@ -157,8 +157,8 @@ def init(
|
|||||||
if not json_mode and verbose:
|
if not json_mode and verbose:
|
||||||
console.print(f" {msg}")
|
console.print(f" {msg}")
|
||||||
|
|
||||||
embed_result = generate_embeddings(
|
embed_result = generate_embeddings_recursive(
|
||||||
index_path,
|
index_root,
|
||||||
model_profile=embedding_model,
|
model_profile=embedding_model,
|
||||||
force=False, # Don't force regenerate during init
|
force=False, # Don't force regenerate during init
|
||||||
chunk_size=2000,
|
chunk_size=2000,
|
||||||
@@ -167,29 +167,56 @@ def init(
|
|||||||
|
|
||||||
if embed_result["success"]:
|
if embed_result["success"]:
|
||||||
embed_data = embed_result["result"]
|
embed_data = embed_result["result"]
|
||||||
result["embeddings_generated"] = True
|
|
||||||
result["embeddings_count"] = embed_data["chunks_embedded"]
|
# Get comprehensive coverage statistics
|
||||||
|
status_result = get_embeddings_status(index_root)
|
||||||
|
if status_result["success"]:
|
||||||
|
coverage = status_result["result"]
|
||||||
|
result["embeddings"] = {
|
||||||
|
"generated": True,
|
||||||
|
"total_indexes": coverage["total_indexes"],
|
||||||
|
"total_files": coverage["total_files"],
|
||||||
|
"files_with_embeddings": coverage["files_with_embeddings"],
|
||||||
|
"coverage_percent": coverage["coverage_percent"],
|
||||||
|
"total_chunks": coverage["total_chunks"],
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
result["embeddings"] = {
|
||||||
|
"generated": True,
|
||||||
|
"total_chunks": embed_data["total_chunks_created"],
|
||||||
|
"files_processed": embed_data["total_files_processed"],
|
||||||
|
}
|
||||||
|
|
||||||
if not json_mode:
|
if not json_mode:
|
||||||
console.print(f"[green]✓[/green] Generated [bold]{embed_data['chunks_embedded']}[/bold] embeddings in {embed_data['elapsed_time']:.1f}s")
|
console.print(f"[green]✓[/green] Generated embeddings for [bold]{embed_data['total_files_processed']}[/bold] files")
|
||||||
|
console.print(f" Total chunks: [bold]{embed_data['total_chunks_created']}[/bold]")
|
||||||
|
console.print(f" Indexes processed: [bold]{embed_data['indexes_successful']}/{embed_data['indexes_processed']}[/bold]")
|
||||||
else:
|
else:
|
||||||
if not json_mode:
|
if not json_mode:
|
||||||
console.print(f"[yellow]Warning:[/yellow] Embedding generation failed: {embed_result.get('error', 'Unknown error')}")
|
console.print(f"[yellow]Warning:[/yellow] Embedding generation failed: {embed_result.get('error', 'Unknown error')}")
|
||||||
result["embeddings_generated"] = False
|
result["embeddings"] = {
|
||||||
result["embeddings_error"] = embed_result.get("error")
|
"generated": False,
|
||||||
|
"error": embed_result.get("error"),
|
||||||
|
}
|
||||||
else:
|
else:
|
||||||
if not json_mode and verbose:
|
if not json_mode and verbose:
|
||||||
console.print("[dim]Semantic search not available. Skipping embeddings.[/dim]")
|
console.print("[dim]Semantic search not available. Skipping embeddings.[/dim]")
|
||||||
result["embeddings_generated"] = False
|
result["embeddings"] = {
|
||||||
result["embeddings_error"] = "Semantic dependencies not installed"
|
"generated": False,
|
||||||
|
"error": "Semantic dependencies not installed",
|
||||||
|
}
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if not json_mode and verbose:
|
if not json_mode and verbose:
|
||||||
console.print(f"[yellow]Warning:[/yellow] Could not generate embeddings: {e}")
|
console.print(f"[yellow]Warning:[/yellow] Could not generate embeddings: {e}")
|
||||||
result["embeddings_generated"] = False
|
result["embeddings"] = {
|
||||||
result["embeddings_error"] = str(e)
|
"generated": False,
|
||||||
|
"error": str(e),
|
||||||
|
}
|
||||||
else:
|
else:
|
||||||
result["embeddings_generated"] = False
|
result["embeddings"] = {
|
||||||
result["embeddings_error"] = "Skipped (--no-embeddings)"
|
"generated": False,
|
||||||
|
"error": "Skipped (--no-embeddings)",
|
||||||
|
}
|
||||||
|
|
||||||
except StorageError as exc:
|
except StorageError as exc:
|
||||||
if json_mode:
|
if json_mode:
|
||||||
@@ -611,6 +638,24 @@ def status(
|
|||||||
except Exception:
|
except Exception:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
# Check embeddings coverage
|
||||||
|
embeddings_info = None
|
||||||
|
has_vector_search = False
|
||||||
|
try:
|
||||||
|
from codexlens.cli.embedding_manager import get_embeddings_status
|
||||||
|
|
||||||
|
if index_root.exists():
|
||||||
|
embed_status = get_embeddings_status(index_root)
|
||||||
|
if embed_status["success"]:
|
||||||
|
embeddings_info = embed_status["result"]
|
||||||
|
# Enable vector search if coverage >= 50%
|
||||||
|
has_vector_search = embeddings_info["coverage_percent"] >= 50.0
|
||||||
|
except ImportError:
|
||||||
|
# Embedding manager not available
|
||||||
|
pass
|
||||||
|
except Exception as e:
|
||||||
|
logger.debug(f"Failed to get embeddings status: {e}")
|
||||||
|
|
||||||
stats = {
|
stats = {
|
||||||
"index_root": str(index_root),
|
"index_root": str(index_root),
|
||||||
"registry_path": str(_get_registry_path()),
|
"registry_path": str(_get_registry_path()),
|
||||||
@@ -624,10 +669,14 @@ def status(
|
|||||||
"exact_fts": True, # Always available
|
"exact_fts": True, # Always available
|
||||||
"fuzzy_fts": has_dual_fts,
|
"fuzzy_fts": has_dual_fts,
|
||||||
"hybrid_search": has_dual_fts,
|
"hybrid_search": has_dual_fts,
|
||||||
"vector_search": False, # Not yet implemented
|
"vector_search": has_vector_search,
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Add embeddings info if available
|
||||||
|
if embeddings_info:
|
||||||
|
stats["embeddings"] = embeddings_info
|
||||||
|
|
||||||
if json_mode:
|
if json_mode:
|
||||||
print_json(success=True, result=stats)
|
print_json(success=True, result=stats)
|
||||||
else:
|
else:
|
||||||
@@ -648,7 +697,20 @@ def status(
|
|||||||
else:
|
else:
|
||||||
console.print(f" Fuzzy FTS: ✗ (run 'migrate' to enable)")
|
console.print(f" Fuzzy FTS: ✗ (run 'migrate' to enable)")
|
||||||
console.print(f" Hybrid Search: ✗ (run 'migrate' to enable)")
|
console.print(f" Hybrid Search: ✗ (run 'migrate' to enable)")
|
||||||
console.print(f" Vector Search: ✗ (future)")
|
|
||||||
|
if has_vector_search:
|
||||||
|
console.print(f" Vector Search: ✓ (embeddings available)")
|
||||||
|
else:
|
||||||
|
console.print(f" Vector Search: ✗ (no embeddings or coverage < 50%)")
|
||||||
|
|
||||||
|
# Display embeddings statistics if available
|
||||||
|
if embeddings_info:
|
||||||
|
console.print("\n[bold]Embeddings Coverage:[/bold]")
|
||||||
|
console.print(f" Total Indexes: {embeddings_info['total_indexes']}")
|
||||||
|
console.print(f" Total Files: {embeddings_info['total_files']}")
|
||||||
|
console.print(f" Files with Embeddings: {embeddings_info['files_with_embeddings']}")
|
||||||
|
console.print(f" Coverage: {embeddings_info['coverage_percent']:.1f}%")
|
||||||
|
console.print(f" Total Chunks: {embeddings_info['total_chunks']}")
|
||||||
|
|
||||||
except StorageError as exc:
|
except StorageError as exc:
|
||||||
if json_mode:
|
if json_mode:
|
||||||
@@ -1885,6 +1947,12 @@ def embeddings_generate(
|
|||||||
"--chunk-size",
|
"--chunk-size",
|
||||||
help="Maximum chunk size in characters.",
|
help="Maximum chunk size in characters.",
|
||||||
),
|
),
|
||||||
|
recursive: bool = typer.Option(
|
||||||
|
False,
|
||||||
|
"--recursive",
|
||||||
|
"-r",
|
||||||
|
help="Recursively process all _index.db files in directory tree.",
|
||||||
|
),
|
||||||
json_mode: bool = typer.Option(False, "--json", help="Output JSON response."),
|
json_mode: bool = typer.Option(False, "--json", help="Output JSON response."),
|
||||||
verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose output."),
|
verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose output."),
|
||||||
) -> None:
|
) -> None:
|
||||||
@@ -1908,16 +1976,30 @@ def embeddings_generate(
|
|||||||
_configure_logging(verbose)
|
_configure_logging(verbose)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from codexlens.cli.embedding_manager import generate_embeddings
|
from codexlens.cli.embedding_manager import generate_embeddings, generate_embeddings_recursive
|
||||||
|
|
||||||
# Resolve path
|
# Resolve path
|
||||||
target_path = path.expanduser().resolve()
|
target_path = path.expanduser().resolve()
|
||||||
|
|
||||||
|
# Determine if we should use recursive mode
|
||||||
|
use_recursive = False
|
||||||
|
index_path = None
|
||||||
|
index_root = None
|
||||||
|
|
||||||
if target_path.is_file() and target_path.name == "_index.db":
|
if target_path.is_file() and target_path.name == "_index.db":
|
||||||
# Direct index file
|
# Direct index file
|
||||||
index_path = target_path
|
index_path = target_path
|
||||||
|
if recursive:
|
||||||
|
# Use parent directory for recursive processing
|
||||||
|
use_recursive = True
|
||||||
|
index_root = target_path.parent
|
||||||
elif target_path.is_dir():
|
elif target_path.is_dir():
|
||||||
# Try to find index for this project
|
if recursive:
|
||||||
|
# Recursive mode: process all _index.db files in directory tree
|
||||||
|
use_recursive = True
|
||||||
|
index_root = target_path
|
||||||
|
else:
|
||||||
|
# Non-recursive: Try to find index for this project
|
||||||
registry = RegistryStore()
|
registry = RegistryStore()
|
||||||
try:
|
try:
|
||||||
registry.initialize()
|
registry.initialize()
|
||||||
@@ -1940,9 +2022,22 @@ def embeddings_generate(
|
|||||||
console.print(f" {msg}")
|
console.print(f" {msg}")
|
||||||
|
|
||||||
console.print(f"[bold]Generating embeddings[/bold]")
|
console.print(f"[bold]Generating embeddings[/bold]")
|
||||||
|
if use_recursive:
|
||||||
|
console.print(f"Index root: [dim]{index_root}[/dim]")
|
||||||
|
console.print(f"Mode: [yellow]Recursive[/yellow]")
|
||||||
|
else:
|
||||||
console.print(f"Index: [dim]{index_path}[/dim]")
|
console.print(f"Index: [dim]{index_path}[/dim]")
|
||||||
console.print(f"Model: [cyan]{model}[/cyan]\n")
|
console.print(f"Model: [cyan]{model}[/cyan]\n")
|
||||||
|
|
||||||
|
if use_recursive:
|
||||||
|
result = generate_embeddings_recursive(
|
||||||
|
index_root,
|
||||||
|
model_profile=model,
|
||||||
|
force=force,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
progress_callback=progress_update,
|
||||||
|
)
|
||||||
|
else:
|
||||||
result = generate_embeddings(
|
result = generate_embeddings(
|
||||||
index_path,
|
index_path,
|
||||||
model_profile=model,
|
model_profile=model,
|
||||||
@@ -1968,6 +2063,30 @@ def embeddings_generate(
|
|||||||
raise typer.Exit(code=1)
|
raise typer.Exit(code=1)
|
||||||
|
|
||||||
data = result["result"]
|
data = result["result"]
|
||||||
|
|
||||||
|
if use_recursive:
|
||||||
|
# Recursive mode output
|
||||||
|
console.print(f"[green]✓[/green] Recursive embeddings generation complete!")
|
||||||
|
console.print(f" Indexes processed: {data['indexes_processed']}")
|
||||||
|
console.print(f" Indexes successful: {data['indexes_successful']}")
|
||||||
|
if data['indexes_failed'] > 0:
|
||||||
|
console.print(f" [yellow]Indexes failed: {data['indexes_failed']}[/yellow]")
|
||||||
|
console.print(f" Total chunks created: {data['total_chunks_created']:,}")
|
||||||
|
console.print(f" Total files processed: {data['total_files_processed']}")
|
||||||
|
if data['total_files_failed'] > 0:
|
||||||
|
console.print(f" [yellow]Total files failed: {data['total_files_failed']}[/yellow]")
|
||||||
|
console.print(f" Model profile: {data['model_profile']}")
|
||||||
|
|
||||||
|
# Show details if verbose
|
||||||
|
if verbose and data.get('details'):
|
||||||
|
console.print("\n[dim]Index details:[/dim]")
|
||||||
|
for detail in data['details']:
|
||||||
|
status_icon = "[green]✓[/green]" if detail['success'] else "[red]✗[/red]"
|
||||||
|
console.print(f" {status_icon} {detail['path']}")
|
||||||
|
if not detail['success'] and detail.get('error'):
|
||||||
|
console.print(f" [dim]Error: {detail['error']}[/dim]")
|
||||||
|
else:
|
||||||
|
# Single index mode output
|
||||||
elapsed = data["elapsed_time"]
|
elapsed = data["elapsed_time"]
|
||||||
|
|
||||||
console.print(f"[green]✓[/green] Embeddings generated successfully!")
|
console.print(f"[green]✓[/green] Embeddings generated successfully!")
|
||||||
|
|||||||
@@ -255,6 +255,21 @@ def generate_embeddings(
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def discover_all_index_dbs(index_root: Path) -> List[Path]:
|
||||||
|
"""Recursively find all _index.db files in an index tree.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index_root: Root directory to scan for _index.db files
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Sorted list of paths to _index.db files
|
||||||
|
"""
|
||||||
|
if not index_root.exists():
|
||||||
|
return []
|
||||||
|
|
||||||
|
return sorted(index_root.rglob("_index.db"))
|
||||||
|
|
||||||
|
|
||||||
def find_all_indexes(scan_dir: Path) -> List[Path]:
|
def find_all_indexes(scan_dir: Path) -> List[Path]:
|
||||||
"""Find all _index.db files in directory tree.
|
"""Find all _index.db files in directory tree.
|
||||||
|
|
||||||
@@ -270,6 +285,146 @@ def find_all_indexes(scan_dir: Path) -> List[Path]:
|
|||||||
return list(scan_dir.rglob("_index.db"))
|
return list(scan_dir.rglob("_index.db"))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def generate_embeddings_recursive(
|
||||||
|
index_root: Path,
|
||||||
|
model_profile: str = "code",
|
||||||
|
force: bool = False,
|
||||||
|
chunk_size: int = 2000,
|
||||||
|
progress_callback: Optional[callable] = None,
|
||||||
|
) -> Dict[str, any]:
|
||||||
|
"""Generate embeddings for all index databases in a project recursively.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index_root: Root index directory containing _index.db files
|
||||||
|
model_profile: Model profile (fast, code, multilingual, balanced)
|
||||||
|
force: If True, regenerate even if embeddings exist
|
||||||
|
chunk_size: Maximum chunk size in characters
|
||||||
|
progress_callback: Optional callback for progress updates
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Aggregated result dictionary with generation statistics
|
||||||
|
"""
|
||||||
|
# Discover all _index.db files
|
||||||
|
index_files = discover_all_index_dbs(index_root)
|
||||||
|
|
||||||
|
if not index_files:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"error": f"No index databases found in {index_root}",
|
||||||
|
}
|
||||||
|
|
||||||
|
if progress_callback:
|
||||||
|
progress_callback(f"Found {len(index_files)} index databases to process")
|
||||||
|
|
||||||
|
# Process each index database
|
||||||
|
all_results = []
|
||||||
|
total_chunks = 0
|
||||||
|
total_files_processed = 0
|
||||||
|
total_files_failed = 0
|
||||||
|
|
||||||
|
for idx, index_path in enumerate(index_files, 1):
|
||||||
|
if progress_callback:
|
||||||
|
try:
|
||||||
|
rel_path = index_path.relative_to(index_root)
|
||||||
|
except ValueError:
|
||||||
|
rel_path = index_path
|
||||||
|
progress_callback(f"[{idx}/{len(index_files)}] Processing {rel_path}")
|
||||||
|
|
||||||
|
result = generate_embeddings(
|
||||||
|
index_path,
|
||||||
|
model_profile=model_profile,
|
||||||
|
force=force,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
progress_callback=None, # Don't cascade callbacks
|
||||||
|
)
|
||||||
|
|
||||||
|
all_results.append({
|
||||||
|
"path": str(index_path),
|
||||||
|
"success": result["success"],
|
||||||
|
"result": result.get("result"),
|
||||||
|
"error": result.get("error"),
|
||||||
|
})
|
||||||
|
|
||||||
|
if result["success"]:
|
||||||
|
data = result["result"]
|
||||||
|
total_chunks += data["chunks_created"]
|
||||||
|
total_files_processed += data["files_processed"]
|
||||||
|
total_files_failed += data["files_failed"]
|
||||||
|
|
||||||
|
successful = sum(1 for r in all_results if r["success"])
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": successful > 0,
|
||||||
|
"result": {
|
||||||
|
"indexes_processed": len(index_files),
|
||||||
|
"indexes_successful": successful,
|
||||||
|
"indexes_failed": len(index_files) - successful,
|
||||||
|
"total_chunks_created": total_chunks,
|
||||||
|
"total_files_processed": total_files_processed,
|
||||||
|
"total_files_failed": total_files_failed,
|
||||||
|
"model_profile": model_profile,
|
||||||
|
"details": all_results,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_embeddings_status(index_root: Path) -> Dict[str, any]:
|
||||||
|
"""Get comprehensive embeddings coverage status for all indexes.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
index_root: Root index directory
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Aggregated status with coverage statistics
|
||||||
|
"""
|
||||||
|
index_files = discover_all_index_dbs(index_root)
|
||||||
|
|
||||||
|
if not index_files:
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"result": {
|
||||||
|
"total_indexes": 0,
|
||||||
|
"total_files": 0,
|
||||||
|
"files_with_embeddings": 0,
|
||||||
|
"files_without_embeddings": 0,
|
||||||
|
"total_chunks": 0,
|
||||||
|
"coverage_percent": 0.0,
|
||||||
|
"indexes_with_embeddings": 0,
|
||||||
|
"indexes_without_embeddings": 0,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
total_files = 0
|
||||||
|
files_with_embeddings = 0
|
||||||
|
total_chunks = 0
|
||||||
|
indexes_with_embeddings = 0
|
||||||
|
|
||||||
|
for index_path in index_files:
|
||||||
|
status = check_index_embeddings(index_path)
|
||||||
|
if status["success"]:
|
||||||
|
result = status["result"]
|
||||||
|
total_files += result["total_files"]
|
||||||
|
files_with_embeddings += result["files_with_chunks"]
|
||||||
|
total_chunks += result["total_chunks"]
|
||||||
|
if result["has_embeddings"]:
|
||||||
|
indexes_with_embeddings += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"result": {
|
||||||
|
"total_indexes": len(index_files),
|
||||||
|
"total_files": total_files,
|
||||||
|
"files_with_embeddings": files_with_embeddings,
|
||||||
|
"files_without_embeddings": total_files - files_with_embeddings,
|
||||||
|
"total_chunks": total_chunks,
|
||||||
|
"coverage_percent": round((files_with_embeddings / total_files * 100) if total_files > 0 else 0, 1),
|
||||||
|
"indexes_with_embeddings": indexes_with_embeddings,
|
||||||
|
"indexes_without_embeddings": len(index_files) - indexes_with_embeddings,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def get_embedding_stats_summary(index_root: Path) -> Dict[str, any]:
|
def get_embedding_stats_summary(index_root: Path) -> Dict[str, any]:
|
||||||
"""Get summary statistics for all indexes in root directory.
|
"""Get summary statistics for all indexes in root directory.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user