Fix CodexLens embeddings generation to achieve 100% coverage

Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files). This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage with 2,042 semantic chunks across all 303 files in 26 index databases. Key improvements: 1. **Recursive embeddings generation** (embedding_manager.py): - Add generate_embeddings_recursive() to process all _index.db files in directory tree - Add get_embeddings_status() for comprehensive coverage statistics - Add discover_all_index_dbs() helper for recursive file discovery 2. **Enhanced CLI commands** (commands.py): - embeddings-generate: Add --recursive flag for full project coverage - init: Use recursive generation by default for complete indexing - status: Display embeddings coverage statistics with 50% threshold 3. **Smart search routing improvements** (smart-search.ts): - Add 50% embeddings coverage threshold for hybrid mode routing - Auto-fallback to exact mode when coverage insufficient - Strip ANSI color codes from JSON output for correct parsing - Add embeddings_coverage_percent to IndexStatus and SearchMetadata - Provide clear warnings with actionable suggestions 4. **Documentation and analysis**: - Add SMART_SEARCH_ANALYSIS.md with initial investigation - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary - Add check_embeddings.py script for coverage verification Results: - Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase - Semantic chunks increased from 10 to 2,042 - 204x increase - All 26 subdirectory indexes now have embeddings vs just 1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-10 02:24:35 +08:00 · 2025-12-17 17:54:33 +08:00
parent d06a3ca12e
commit 74a830694c
7 changed files with 1540 additions and 346 deletions
--- a/ccw/EMBEDDINGS_FIX_SUMMARY.md
+++ b/ccw/EMBEDDINGS_FIX_SUMMARY.md
@@ -0,0 +1,165 @@
+# CodexLens Embeddings 修复总结
+
+## 修复成果
+
+### ✅ 已完成
+
+1. **递归 embeddings 生成功能** (`embedding_manager.py`)
+   - 添加 `generate_embeddings_recursive()` 函数
+   - 添加 `get_embeddings_status()` 函数
+   - 递归处理所有子目录的 _index.db 文件
+
+2. **CLI 命令增强** (`commands.py`)
+   - `embeddings-generate` 添加 `--recursive` 标志
+   - `init` 命令使用递归生成（自动处理所有子目录）
+   - `status` 命令显示 embeddings 覆盖率统计
+
+3. **Smart Search 智能路由** (`smart-search.ts`)
+   - 添加 50% 覆盖率阈值
+   - embeddings 不足时自动降级到 exact 模式
+   - 提供明确的警告信息
+   - Strip ANSI 颜色码以正确解析 JSON
+
+### ✅ 测试结果
+
+**CCW 项目 (d:\Claude_dms3\ccw)**:
+- 索引数据库：26 个
+- 文件总数：303
+- Embeddings 覆盖：**100%** (所有 303 个文件)
+- 生成 chunks：**2,042** (之前只有 10)
+
+**对比**:
+| 指标 | 修复前 | 修复后 | 改进 |
+|------|--------|--------|------|
+| 覆盖率 | 1.6% (5/303) | 100% (303/303) | **62.5x** |
+| Chunks | 10 | 2,042 | **204x** |
+| 有 embeddings 的索引 | 1/26 | 26/26 | **26x** |
+
+## 当前问题
+
+### ⚠️ 遗留问题
+
+1. **路径映射问题**
+   - `embeddings-generate --recursive` 需要使用索引路径而非源路径
+   - 用户应该能够使用源路径（`d:\Claude_dms3\ccw`）
+   - 当前需要使用：`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw`
+
+2. **Status 命令的全局 vs 项目级别**
+   - `codexlens status` 返回全局统计（所有项目）
+   - 需要项目级别的 embeddings 状态
+   - `embeddings-status` 只检查单个 _index.db，不递归
+
+## 建议的后续修复
+
+### P1 - 路径映射修复
+
+修改 `commands.py` 中的 `embeddings_generate` 命令（line 1996-2000）：
+
+```python
+elif target_path.is_dir():
+    if recursive:
+        # Recursive mode: Map source path to index root
+        registry = RegistryStore()
+        try:
+            registry.initialize()
+            mapper = PathMapper()
+            index_db_path = mapper.source_to_index_db(target_path)
+            index_root = index_db_path.parent  # Use index directory root
+            use_recursive = True
+        finally:
+            registry.close()
+```
+
+### P2 - 项目级别 Status
+
+选项 A：扩展 `embeddings-status` 命令支持递归
+```bash
+codexlens embeddings-status . --recursive --json
+```
+
+选项 B：修改 `status` 命令接受路径参数
+```bash
+codexlens status --project . --json
+```
+
+## 使用指南
+
+### 当前工作流程
+
+**生成 embeddings（完整覆盖）**:
+```bash
+# 方法 1: 使用索引路径（当前工作方式）
+cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
+python -m codexlens embeddings-generate . --recursive --force --model fast
+
+# 方法 2: init 命令（自动递归，推荐）
+cd d:\Claude_dms3\ccw
+python -m codexlens init . --force
+```
+
+**检查覆盖率**:
+```bash
+# 项目根目录
+cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
+python check_embeddings.py  # 显示详细的每目录统计
+
+# 全局状态
+python -m codexlens status --json  # 所有项目的汇总
+```
+
+**Smart Search**:
+```javascript
+// MCP 工具调用
+smart_search(query="authentication patterns")
+
+// 现在会：
+// 1. 检查 embeddings 覆盖率
+// 2. 如果 >= 50%，使用 hybrid 模式
+// 3. 如果 < 50%，降级到 exact 模式
+// 4. 显示警告信息
+```
+
+### 最佳实践
+
+1. **初始化项目时自动生成 embeddings**:
+   ```bash
+   codexlens init /path/to/project --force
+   ```
+
+2. **定期重新生成以更新**:
+   ```bash
+   codexlens embeddings-generate /index/path --recursive --force
+   ```
+
+3. **使用 fast 模型快速测试**:
+   ```bash
+   codexlens embeddings-generate . --recursive --model fast
+   ```
+
+4. **使用 code 模型获得最佳质量**:
+   ```bash
+   codexlens embeddings-generate . --recursive --model code
+   ```
+
+## 技术细节
+
+### 文件修改清单
+
+**Python (CodexLens)**:
+- `codex-lens/src/codexlens/cli/embedding_manager.py` - 添加递归函数
+- `codex-lens/src/codexlens/cli/commands.py` - 更新 init, status, embeddings-generate
+
+**TypeScript (CCW)**:
+- `ccw/src/tools/smart-search.ts` - 智能路由 + ANSI stripping
+- `ccw/src/tools/codex-lens.ts` - （未修改，使用现有实现）
+
+### 依赖版本
+
+- CodexLens: 当前开发版本
+- Fastembed: 已安装（ONNX backend）
+- Models: fast (~80MB), code (~150MB)
+
+---
+
+**修复时间**: 2025-12-17  
+**验证状态**: ✅ 核心功能正常，遗留路径映射问题待修复
--- a/ccw/SMART_SEARCH_ANALYSIS.md
+++ b/ccw/SMART_SEARCH_ANALYSIS.md
@@ -0,0 +1,167 @@
+# Smart Search 索引分析报告
+
+## 问题
+分析当前 `smart_search(action="init")` 是否进行了向量模型索引，还是仅进行了基础索引。
+
+## 分析结果
+
+### 1. Init 操作的默认行为
+
+从代码分析来看，`smart_search(action="init")` 的行为如下：
+
+**代码路径**：`ccw/src/tools/smart-search.ts` → `ccw/src/tools/codex-lens.ts`
+
+```typescript
+// smart-search.ts: executeInitAction (第 297-323 行)
+async function executeInitAction(params: Params): Promise<SearchResult> {
+  const { path = '.', languages } = params;
+  const args = ['init', path];
+  if (languages && languages.length > 0) {
+    args.push('--languages', languages.join(','));
+  }
+  const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
+  // ...
+}
+```
+
+**关键发现**：
+- `smart_search(action="init")` 调用 `codexlens init` 命令
+- **不传递** `--no-embeddings` 参数
+- **不传递** `--embedding-model` 参数
+
+### 2. CodexLens Init 的默认行为
+
+根据 `codexlens init --help` 的输出：
+
+> If semantic search dependencies are installed, **automatically generates embeddings** after indexing completes. Use --no-embeddings to skip this step.
+
+**结论**：
+- ✅ `init` 命令**默认会**生成 embeddings（如果安装了语义搜索依赖）
+- ❌ 当前实现**未生成**所有文件的 embeddings
+
+### 3. 实际测试结果
+
+#### 第一次 Init（未生成 embeddings）
+```bash
+$ smart_search(action="init", path="d:\\Claude_dms3\\ccw")
+# 结果：索引了 303 个文件，但 vector_search: false
+```
+
+**原因分析**：
+虽然语义搜索依赖（fastembed）已安装，但 init 过程中遇到警告：
+```
+Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate.
+```
+
+#### 手动生成 Embeddings 后
+```bash
+$ python -m codexlens embeddings-generate . --force --verbose
+
+Processing 5 files...
+- D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks
+- D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks
+- D:\Claude_dms3\ccw\README.md: 2 chunks
+- D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks
+- D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks
+
+Total: 10 chunks, 5 files
+Model: jinaai/jina-embeddings-v2-base-code (768 dimensions)
+```
+
+**关键发现**：
+- ⚠️ 只为 **5 个文档/配置文件**生成了 embeddings
+- ⚠️ **未为 298 个代码文件**（.ts, .js 等）生成 embeddings
+- ✅ Embeddings 状态显示 `coverage_percent: 100.0`（但这是针对"应该生成 embeddings 的文件"而言）
+
+#### Hybrid Search 测试
+```bash
+$ smart_search(query="authentication and authorization patterns", mode="hybrid")
+# ✅ 成功返回 5 个结果，带有相似度分数
+# ✅ 证明向量搜索功能可用
+```
+
+## 4. 索引类型对比
+
+| 索引类型 | 当前状态 | 支持的文件 | 说明 |
+|---------|---------|-----------|------|
+| **Exact FTS** | ✅ 启用 | 所有 303 个文件 | 基于 SQLite FTS5 的全文搜索 |
+| **Fuzzy FTS** | ❌ 未启用 | - | 模糊匹配搜索 |
+| **Vector Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | 基于 fastembed 的语义搜索 |
+| **Hybrid Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | RRF 融合（exact + fuzzy + vector） |
+
+## 5. 为什么只有 5 个文件有 Embeddings？
+
+**可能的原因**：
+
+1. **文件类型过滤**：CodexLens 可能只为文档文件（.md）和配置文件生成 embeddings
+2. **代码文件使用符号索引**：代码文件（.ts, .js）可能依赖于符号提取而非文本 embeddings
+3. **性能考虑**：生成 300+ 文件的 embeddings 需要大量时间和存储空间
+
+## 6. 结论
+
+### 当前 `smart_search(action="init")` 的行为：
+
+✅ **会尝试**生成向量索引（如果语义依赖已安装）  
+⚠️ **实际只**为文档/配置文件生成 embeddings（5/303 文件）  
+✅ **支持** hybrid 模式搜索（对于有 embeddings 的文件）  
+✅ **支持** exact 模式搜索（对于所有 303 个文件）  
+
+### 搜索模式智能路由：
+
+```
+用户查询 → auto 模式 → 决策树：
+  ├─ 自然语言查询 + 有 embeddings → hybrid 模式（RRF 融合）
+  ├─ 简单查询 + 有索引 → exact 模式（FTS）
+  └─ 无索引 → ripgrep 模式（字面匹配）
+```
+
+## 7. 建议
+
+### 如果需要完整的语义搜索支持：
+
+```bash
+# 方案 1：检查是否所有代码文件都应该有 embeddings
+python -m codexlens embeddings-status . --verbose
+
+# 方案 2：明确为代码文件生成 embeddings（如果支持）
+# 需要查看 CodexLens 文档确认代码文件的语义索引策略
+
+# 方案 3：使用 hybrid 模式进行文档搜索，exact 模式进行代码搜索
+smart_search(query="架构设计", mode="hybrid")  # 文档语义搜索
+smart_search(query="function_name", mode="exact")  # 代码精确搜索
+```
+
+### 当前最佳实践：
+
+```javascript
+// 1. 初始化索引（一次性）
+smart_search(action="init", path=".")
+
+// 2. 智能搜索（推荐使用 auto 模式）
+smart_search(query="your query")  // 自动选择最佳模式
+
+// 3. 特定模式搜索
+smart_search(query="natural language query", mode="hybrid")  // 语义搜索
+smart_search(query="exact_identifier", mode="exact")         // 精确匹配
+smart_search(query="quick literal", mode="ripgrep")          // 快速字面搜索
+```
+
+## 8. 技术细节
+
+### Embeddings 模型
+- **模型**：jinaai/jina-embeddings-v2-base-code
+- **维度**：768
+- **大小**：~150MB
+- **后端**：fastembed (ONNX-based)
+
+### 索引存储
+- **位置**：`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db`
+- **大小**：122.57 MB
+- **Schema 版本**：5
+- **文件数**：303
+- **目录数**：26
+
+---
+
+**生成时间**：2025-12-17  
+**CodexLens 版本**：从当前安装中检测
--- a/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
+++ b/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
@@ -0,0 +1,330 @@
+# Smart Search 索引分析报告（修正版）
+
+## 用户质疑
+
+1. ❓ 为什么不为代码文件生成向量 embeddings？
+2. ❓ Exact FTS 和 Vector 索引内容应该一样才对
+3. ❓ init 应该返回 FTS 和 vector 索引概况
+
+**结论：用户的质疑 100% 正确！这是 CodexLens 的设计缺陷。**
+
+---
+
+## 真实情况
+
+### 1. 分层索引架构
+
+CodexLens 使用**分层目录索引**：
+
+```
+D:\Claude_dms3\ccw\
+├── _index.db             ← 根目录索引（5个文件）
+├── src/
+│   ├── _index.db         ← src目录索引（2个文件）
+│   ├── tools/
+│   │   └── _index.db     ← tools子目录索引（25个文件）
+│   └── ...
+└── ... （总共 26 个 _index.db）
+```
+
+### 2. 索引覆盖情况
+
+| 目录 | 文件数 | FTS索引 | Embeddings |
+|------|--------|---------|------------|
+| **根目录** | 5 | ✅ | ✅ (10 chunks) |
+| bin/ | 2 | ✅ | ❌ 无semantic_chunks表 |
+| dist/ | 4 | ✅ | ❌ 无semantic_chunks表 |
+| dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 |
+| dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 |
+| src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 |
+| src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 |
+| ... | ... | ... | ... |
+| **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) |
+
+### 3. 关键发现
+
+```python
+# 运行检查脚本的结果
+Total index databases: 26
+Directories with embeddings: 1        # ❌ 只有根目录！
+Total files indexed: 303              # ✅ FTS索引完整
+Total semantic chunks: 10             # ❌ 只有根目录的5个文件
+```
+
+**问题**：
+- ✅ **所有303个文件**都有 FTS 索引（分布在26个_index.db中）
+- ❌ **只有5个文件**（1.6%）有 vector embeddings
+- ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构
+
+---
+
+## 为什么会这样？
+
+### 原因分析
+
+1. **`init` 操作**：
+   ```bash
+   codexlens init .
+   ```
+   - ✅ 为所有303个文件创建 FTS 索引（分布式）
+   - ⚠️ 尝试生成 embeddings，但遇到"Index already has 10 chunks"警告
+   - ❌ 只为根目录生成了 embeddings
+
+2. **`embeddings-generate` 操作**：
+   ```bash
+   codexlens embeddings-generate . --force
+   ```
+   - ❌ 只处理了根目录的 _index.db
+   - ❌ **未递归处理子目录的索引**
+   - 结果：只有5个文档文件有 embeddings
+
+### 设计问题
+
+**CodexLens 的 embeddings 架构有缺陷**：
+
+```python
+# 期望行为
+for each _index.db in project:
+    generate_embeddings(index_db)
+
+# 实际行为  
+generate_embeddings(root_index_db_only)
+```
+
+---
+
+## Init 返回信息缺陷
+
+### 当前 `init` 的返回
+
+```json
+{
+  "success": true,
+  "message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw"
+}
+```
+
+**问题**：
+- ❌ 没有说明索引了多少文件
+- ❌ 没有说明是否生成了 embeddings
+- ❌ 没有说明 embeddings 覆盖率
+
+### 应该返回的信息
+
+```json
+{
+  "success": true,
+  "message": "Index created successfully",
+  "stats": {
+    "total_files": 303,
+    "total_directories": 26,
+    "index_databases": 26,
+    "fts_coverage": {
+      "files": 303,
+      "percentage": 100.0
+    },
+    "embeddings_coverage": {
+      "files": 5,
+      "chunks": 10,
+      "percentage": 1.6,
+      "warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage."
+    },
+    "features": {
+      "exact_fts": true,
+      "fuzzy_fts": false,
+      "vector_search": "partial"
+    }
+  }
+}
+```
+
+---
+
+## 解决方案
+
+### 方案 1：递归生成 Embeddings（推荐）
+
+```bash
+# 为所有子目录生成 embeddings
+find .codexlens/indexes -name "_index.db" -exec \
+  python -m codexlens embeddings-generate {} --force \;
+```
+
+### 方案 2：改进 Init 命令
+
+```python
+# codexlens/cli.py
+def init_with_embeddings(project_root):
+    """Initialize with recursive embeddings generation"""
+    # 1. Build FTS indexes (current behavior)
+    build_indexes(project_root)
+    
+    # 2. Generate embeddings for ALL subdirs
+    for index_db in find_all_index_dbs(project_root):
+        if has_semantic_deps():
+            generate_embeddings(index_db)
+    
+    # 3. Return comprehensive stats
+    return {
+        "fts_coverage": get_fts_stats(),
+        "embeddings_coverage": get_embeddings_stats(),
+        "features": detect_features()
+    }
+```
+
+### 方案 3：Smart Search 路由改进
+
+```python
+# 当前逻辑
+def classify_intent(query, hasIndex):
+    if not hasIndex:
+        return "ripgrep"
+    elif is_natural_language(query):
+        return "hybrid"  # ❌ 但只有5个文件有embeddings！
+    else:
+        return "exact"
+
+# 改进逻辑
+def classify_intent(query, indexStatus):
+    embeddings_coverage = indexStatus.embeddings_coverage_percent
+    
+    if embeddings_coverage < 50:
+        # 如果覆盖率<50%，即使是自然语言也降级到exact
+        return "exact" if indexStatus.indexed else "ripgrep"
+    elif is_natural_language(query):
+        return "hybrid"
+    else:
+        return "exact"
+```
+
+---
+
+## 验证用户质疑
+
+### ❓ 为什么不为代码文件生成 embeddings？
+
+**答**：不是"不为代码文件生成"，而是：
+- ✅ 代码文件都有 FTS 索引
+- ❌ `embeddings-generate` 命令有BUG，**只处理根目录**
+- ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表**
+
+### ❓ FTS 和 Vector 应该索引相同内容
+
+**答**：**完全正确！** 当前实际情况：
+- FTS: 303/303 (100%)
+- Vector: 5/303 (1.6%)
+
+**这是严重的不一致性，违背了设计原则。**
+
+### ❓ Init 应该返回索引概况
+
+**答**：**完全正确！** 当前 init 只返回简单成功消息，应该返回：
+- FTS 索引统计
+- Embeddings 覆盖率
+- 功能特性状态
+- 警告信息（如果覆盖不完整）
+
+---
+
+## 测试验证
+
+### Hybrid Search 的实际效果
+
+```javascript
+// 当前查询
+smart_search(query="authentication patterns", mode="hybrid")
+
+// 实际搜索范围：
+// ✅ 可搜索的文件：5个（根目录的.md文件）
+// ❌ 不可搜索的文件：298个代码文件
+// 结果：返回的都是文档文件，代码文件被忽略
+```
+
+### 修复后的效果（理想状态）
+
+```javascript
+// 修复后
+smart_search(query="authentication patterns", mode="hybrid")
+
+// 实际搜索范围：
+// ✅ 可搜索的文件：303个（所有文件）
+// 结果：包含代码文件和文档文件的综合结果
+```
+
+---
+
+## 建议的修复优先级
+
+### P0 - 紧急修复
+
+1. **修复 `embeddings-generate` 命令**
+   - 递归处理所有子目录的 _index.db
+   - 为每个 _index.db 创建 semantic_chunks 表
+
+2. **改进 `init` 返回信息**
+   - 返回详细的索引统计
+   - 显示 embeddings 覆盖率
+   - 如果覆盖不完整，给出警告
+
+### P1 - 重要改进
+
+3. **Smart Search 自适应路由**
+   - 检查 embeddings 覆盖率
+   - 如果覆盖率低，自动降级到 exact 模式
+
+4. **Status 命令增强**
+   - 显示每个子目录的索引状态
+   - 显示 embeddings 分布情况
+
+---
+
+## 临时解决方案
+
+### 当前推荐使用方式
+
+```javascript
+// 1. 文档搜索 - 使用 hybrid（有embeddings）
+smart_search(query="architecture design patterns", mode="hybrid")
+
+// 2. 代码搜索 - 使用 exact（无embeddings，但有FTS）
+smart_search(query="function executeQuery", mode="exact")
+
+// 3. 快速搜索 - 使用 ripgrep（跨所有文件）
+smart_search(query="TODO", mode="ripgrep")
+```
+
+### 完整覆盖的变通方案
+
+```bash
+# 手动为所有子目录生成 embeddings（如果CodexLens支持）
+cd D:\Claude_dms3\ccw
+
+# 为每个子目录分别运行
+python -m codexlens embeddings-generate ./src/tools --force
+python -m codexlens embeddings-generate ./src/commands --force
+# ... 重复26次
+
+# 或使用脚本自动化
+python check_embeddings.py --generate-all
+```
+
+---
+
+## 总结
+
+| 用户质疑 | 状态 | 结论 |
+|---------|------|------|
+| 为什么不对代码生成embeddings？ | ✅ 正确 | 是BUG，不是设计 |
+| FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 |
+| Init应返回详细概况 | ✅ 正确 | 当前信息不足 |
+
+**用户的所有质疑都是正确的，揭示了 CodexLens 的三个核心问题：**
+
+1. **Embeddings 生成不完整**（只有1.6%覆盖率）
+2. **索引一致性问题**（FTS vs Vector）
+3. **返回信息不透明**（缺少统计数据）
+
+---
+
+**生成时间**：2025-12-17  
+**验证方法**：`python check_embeddings.py`
--- a/ccw/check_embeddings.py
+++ b/ccw/check_embeddings.py
@@ -0,0 +1,47 @@
+import sqlite3
+import os
+
+# Find all _index.db files
+root_dir = r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw'
+index_files = []
+for dirpath, dirnames, filenames in os.walk(root_dir):
+    if '_index.db' in filenames:
+        index_files.append(os.path.join(dirpath, '_index.db'))
+
+print(f'Found {len(index_files)} index databases\n')
+
+total_files = 0
+total_chunks = 0
+dirs_with_chunks = 0
+
+for db_path in sorted(index_files):
+    rel_path = db_path.replace(r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\\', '')
+    conn = sqlite3.connect(db_path)
+    
+    try:
+        cursor = conn.execute('SELECT COUNT(*) FROM files')
+        file_count = cursor.fetchone()[0]
+        total_files += file_count
+        
+        try:
+            cursor = conn.execute('SELECT COUNT(*) FROM semantic_chunks')
+            chunk_count = cursor.fetchone()[0]
+            total_chunks += chunk_count
+            
+            if chunk_count > 0:
+                dirs_with_chunks += 1
+                print(f'[+] {rel_path:<40} Files: {file_count:3d}  Chunks: {chunk_count:3d}')
+            else:
+                print(f'[ ] {rel_path:<40} Files: {file_count:3d}  (no chunks)')
+        except sqlite3.OperationalError:
+            print(f'[ ] {rel_path:<40} Files: {file_count:3d}  (no semantic_chunks table)')
+    except Exception as e:
+        print(f'[!] {rel_path:<40} Error: {e}')
+    finally:
+        conn.close()
+
+print(f'\n=== Summary ===')
+print(f'Total index databases: {len(index_files)}')
+print(f'Directories with embeddings: {dirs_with_chunks}')
+print(f'Total files indexed: {total_files}')
+print(f'Total semantic chunks: {total_chunks}')
--- a/ccw/src/tools/smart-search.ts
+++ b/ccw/src/tools/smart-search.ts