Fix CodexLens embeddings generation to achieve 100% coverage

Previously, embeddings were only generated for root directory files (1.6% coverage, 5/303 files). This fix implements recursive processing across all subdirectory indexes, achieving 100% coverage with 2,042 semantic chunks across all 303 files in 26 index databases. Key improvements: 1. **Recursive embeddings generation** (embedding_manager.py): - Add generate_embeddings_recursive() to process all _index.db files in directory tree - Add get_embeddings_status() for comprehensive coverage statistics - Add discover_all_index_dbs() helper for recursive file discovery 2. **Enhanced CLI commands** (commands.py): - embeddings-generate: Add --recursive flag for full project coverage - init: Use recursive generation by default for complete indexing - status: Display embeddings coverage statistics with 50% threshold 3. **Smart search routing improvements** (smart-search.ts): - Add 50% embeddings coverage threshold for hybrid mode routing - Auto-fallback to exact mode when coverage insufficient - Strip ANSI color codes from JSON output for correct parsing - Add embeddings_coverage_percent to IndexStatus and SearchMetadata - Provide clear warnings with actionable suggestions 4. **Documentation and analysis**: - Add SMART_SEARCH_ANALYSIS.md with initial investigation - Add SMART_SEARCH_CORRECTED_ANALYSIS.md revealing true extent of issue - Add EMBEDDINGS_FIX_SUMMARY.md with complete fix summary - Add check_embeddings.py script for coverage verification Results: - Coverage improved from 1.6% (5/303 files) to 100% (303/303 files) - 62.5x increase - Semantic chunks increased from 10 to 2,042 - 204x increase - All 26 subdirectory indexes now have embeddings vs just 1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-13 02:41:50 +08:00 · 2025-12-17 17:54:33 +08:00
parent d06a3ca12e
commit 74a830694c
7 changed files with 1540 additions and 346 deletions
--- a/ccw/EMBEDDINGS_FIX_SUMMARY.md
+++ b/ccw/EMBEDDINGS_FIX_SUMMARY.md
@@ -0,0 +1,165 @@
 # CodexLens Embeddings 修复总结
 ## 修复成果
 ### ✅ 已完成
 1. **递归 embeddings 生成功能** (`embedding_manager.py`)
   - 添加 `generate_embeddings_recursive()` 函数
   - 添加 `get_embeddings_status()` 函数
   - 递归处理所有子目录的 _index.db 文件
 2. **CLI 命令增强** (`commands.py`)
   - `embeddings-generate` 添加 `--recursive` 标志
   - `init` 命令使用递归生成（自动处理所有子目录）
   - `status` 命令显示 embeddings 覆盖率统计
 3. **Smart Search 智能路由** (`smart-search.ts`)
   - 添加 50% 覆盖率阈值
   - embeddings 不足时自动降级到 exact 模式
   - 提供明确的警告信息
   - Strip ANSI 颜色码以正确解析 JSON
 ### ✅ 测试结果
 **CCW 项目 (d:\Claude_dms3\ccw)**:
 - 索引数据库：26 个
 - 文件总数：303
 - Embeddings 覆盖：**100%** (所有 303 个文件)
 - 生成 chunks：**2,042** (之前只有 10)
 **对比**:
 | 指标 | 修复前 | 修复后 | 改进 |
 |------|--------|--------|------|
 | 覆盖率 | 1.6% (5/303) | 100% (303/303) | **62.5x** |
 | Chunks | 10 | 2,042 | **204x** |
 | 有 embeddings 的索引 | 1/26 | 26/26 | **26x** |
 ## 当前问题
 ### ⚠️ 遗留问题
 1. **路径映射问题**
   - `embeddings-generate --recursive` 需要使用索引路径而非源路径
   - 用户应该能够使用源路径（`d:\Claude_dms3\ccw`）
   - 当前需要使用：`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw`
 2. **Status 命令的全局 vs 项目级别**
   - `codexlens status` 返回全局统计（所有项目）
   - 需要项目级别的 embeddings 状态
   - `embeddings-status` 只检查单个 _index.db，不递归
 ## 建议的后续修复
 ### P1 - 路径映射修复
 修改 `commands.py` 中的 `embeddings_generate` 命令（line 1996-2000）：
 ```python
 elif target_path.is_dir():
    if recursive:
        # Recursive mode: Map source path to index root
        registry = RegistryStore()
        try:
            registry.initialize()
            mapper = PathMapper()
            index_db_path = mapper.source_to_index_db(target_path)
            index_root = index_db_path.parent  # Use index directory root
            use_recursive = True
        finally:
            registry.close()
 ```
 ### P2 - 项目级别 Status
 选项 A：扩展 `embeddings-status` 命令支持递归
 ```bash
 codexlens embeddings-status . --recursive --json
 ```
 选项 B：修改 `status` 命令接受路径参数
 ```bash
 codexlens status --project . --json
 ```
 ## 使用指南
 ### 当前工作流程
 **生成 embeddings（完整覆盖）**:
 ```bash
 # 方法 1: 使用索引路径（当前工作方式）
 cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
 python -m codexlens embeddings-generate . --recursive --force --model fast
 # 方法 2: init 命令（自动递归，推荐）
 cd d:\Claude_dms3\ccw
 python -m codexlens init . --force
 ```
 **检查覆盖率**:
 ```bash
 # 项目根目录
 cd C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw
 python check_embeddings.py  # 显示详细的每目录统计
 # 全局状态
 python -m codexlens status --json  # 所有项目的汇总
 ```
 **Smart Search**:
 ```javascript
 // MCP 工具调用
 smart_search(query="authentication patterns")
 // 现在会：
 // 1. 检查 embeddings 覆盖率
 // 2. 如果 >= 50%，使用 hybrid 模式
 // 3. 如果 < 50%，降级到 exact 模式
 // 4. 显示警告信息
 ```
 ### 最佳实践
 1. **初始化项目时自动生成 embeddings**:
   ```bash
   codexlens init /path/to/project --force
   ```
 2. **定期重新生成以更新**:
   ```bash
   codexlens embeddings-generate /index/path --recursive --force
   ```
 3. **使用 fast 模型快速测试**:
   ```bash
   codexlens embeddings-generate . --recursive --model fast
   ```
 4. **使用 code 模型获得最佳质量**:
   ```bash
   codexlens embeddings-generate . --recursive --model code
   ```
 ## 技术细节
 ### 文件修改清单
 **Python (CodexLens)**:
 - `codex-lens/src/codexlens/cli/embedding_manager.py` - 添加递归函数
 - `codex-lens/src/codexlens/cli/commands.py` - 更新 init, status, embeddings-generate
 **TypeScript (CCW)**:
 - `ccw/src/tools/smart-search.ts` - 智能路由 + ANSI stripping
 - `ccw/src/tools/codex-lens.ts` - （未修改，使用现有实现）
 ### 依赖版本
 - CodexLens: 当前开发版本
 - Fastembed: 已安装（ONNX backend）
 - Models: fast (~80MB), code (~150MB)
 ---
 **修复时间**: 2025-12-17  
 **验证状态**: ✅ 核心功能正常，遗留路径映射问题待修复
--- a/ccw/SMART_SEARCH_ANALYSIS.md
+++ b/ccw/SMART_SEARCH_ANALYSIS.md
@@ -0,0 +1,167 @@
 # Smart Search 索引分析报告
 ## 问题
 分析当前 `smart_search(action="init")` 是否进行了向量模型索引，还是仅进行了基础索引。
 ## 分析结果
 ### 1. Init 操作的默认行为
 从代码分析来看，`smart_search(action="init")` 的行为如下：
 **代码路径**：`ccw/src/tools/smart-search.ts` → `ccw/src/tools/codex-lens.ts`
 ```typescript
 // smart-search.ts: executeInitAction (第 297-323 行)
 async function executeInitAction(params: Params): Promise<SearchResult> {
  const { path = '.', languages } = params;
  const args = ['init', path];
  if (languages && languages.length > 0) {
    args.push('--languages', languages.join(','));
  }
  const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
  // ...
 }
 ```
 **关键发现**：
 - `smart_search(action="init")` 调用 `codexlens init` 命令
 - **不传递** `--no-embeddings` 参数
 - **不传递** `--embedding-model` 参数
 ### 2. CodexLens Init 的默认行为
 根据 `codexlens init --help` 的输出：
 > If semantic search dependencies are installed, **automatically generates embeddings** after indexing completes. Use --no-embeddings to skip this step.
 **结论**：
 - ✅ `init` 命令**默认会**生成 embeddings（如果安装了语义搜索依赖）
 - ❌ 当前实现**未生成**所有文件的 embeddings
 ### 3. 实际测试结果
 #### 第一次 Init（未生成 embeddings）
 ```bash
 $ smart_search(action="init", path="d:\\Claude_dms3\\ccw")
 # 结果：索引了 303 个文件，但 vector_search: false
 ```
 **原因分析**：
 虽然语义搜索依赖（fastembed）已安装，但 init 过程中遇到警告：
 ```
 Warning: Embedding generation failed: Index already has 10 chunks. Use --force to regenerate.
 ```
 #### 手动生成 Embeddings 后
 ```bash
 $ python -m codexlens embeddings-generate . --force --verbose
 Processing 5 files...
 - D:\Claude_dms3\ccw\MCP_QUICKSTART.md: 1 chunks
 - D:\Claude_dms3\ccw\MCP_SERVER.md: 2 chunks
 - D:\Claude_dms3\ccw\README.md: 2 chunks
 - D:\Claude_dms3\ccw\tailwind.config.js: 3 chunks
 - D:\Claude_dms3\ccw\WRITE_FILE_FIX_SUMMARY.md: 2 chunks
 Total: 10 chunks, 5 files
 Model: jinaai/jina-embeddings-v2-base-code (768 dimensions)
 ```
 **关键发现**：
 - ⚠️ 只为 **5 个文档/配置文件**生成了 embeddings
 - ⚠️ **未为 298 个代码文件**（.ts, .js 等）生成 embeddings
 - ✅ Embeddings 状态显示 `coverage_percent: 100.0`（但这是针对"应该生成 embeddings 的文件"而言）
 #### Hybrid Search 测试
 ```bash
 $ smart_search(query="authentication and authorization patterns", mode="hybrid")
 # ✅ 成功返回 5 个结果，带有相似度分数
 # ✅ 证明向量搜索功能可用
 ```
 ## 4. 索引类型对比
 | 索引类型 | 当前状态 | 支持的文件 | 说明 |
 |---------|---------|-----------|------|
 | **Exact FTS** | ✅ 启用 | 所有 303 个文件 | 基于 SQLite FTS5 的全文搜索 |
 | **Fuzzy FTS** | ❌ 未启用 | - | 模糊匹配搜索 |
 | **Vector Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | 基于 fastembed 的语义搜索 |
 | **Hybrid Search** | ⚠️ 部分启用 | 仅 5 个文档文件 | RRF 融合（exact + fuzzy + vector） |
 ## 5. 为什么只有 5 个文件有 Embeddings？
 **可能的原因**：
 1. **文件类型过滤**：CodexLens 可能只为文档文件（.md）和配置文件生成 embeddings
 2. **代码文件使用符号索引**：代码文件（.ts, .js）可能依赖于符号提取而非文本 embeddings
 3. **性能考虑**：生成 300+ 文件的 embeddings 需要大量时间和存储空间
 ## 6. 结论
 ### 当前 `smart_search(action="init")` 的行为：
 ✅ **会尝试**生成向量索引（如果语义依赖已安装）  
 ⚠️ **实际只**为文档/配置文件生成 embeddings（5/303 文件）  
 ✅ **支持** hybrid 模式搜索（对于有 embeddings 的文件）  
 ✅ **支持** exact 模式搜索（对于所有 303 个文件）  
 ### 搜索模式智能路由：
 ```
 用户查询 → auto 模式 → 决策树：
  ├─ 自然语言查询 + 有 embeddings → hybrid 模式（RRF 融合）
  ├─ 简单查询 + 有索引 → exact 模式（FTS）
  └─ 无索引 → ripgrep 模式（字面匹配）
 ```
 ## 7. 建议
 ### 如果需要完整的语义搜索支持：
 ```bash
 # 方案 1：检查是否所有代码文件都应该有 embeddings
 python -m codexlens embeddings-status . --verbose
 # 方案 2：明确为代码文件生成 embeddings（如果支持）
 # 需要查看 CodexLens 文档确认代码文件的语义索引策略
 # 方案 3：使用 hybrid 模式进行文档搜索，exact 模式进行代码搜索
 smart_search(query="架构设计", mode="hybrid")  # 文档语义搜索
 smart_search(query="function_name", mode="exact")  # 代码精确搜索
 ```
 ### 当前最佳实践：
 ```javascript
 // 1. 初始化索引（一次性）
 smart_search(action="init", path=".")
 // 2. 智能搜索（推荐使用 auto 模式）
 smart_search(query="your query")  // 自动选择最佳模式
 // 3. 特定模式搜索
 smart_search(query="natural language query", mode="hybrid")  // 语义搜索
 smart_search(query="exact_identifier", mode="exact")         // 精确匹配
 smart_search(query="quick literal", mode="ripgrep")          // 快速字面搜索
 ```
 ## 8. 技术细节
 ### Embeddings 模型
 - **模型**：jinaai/jina-embeddings-v2-base-code
 - **维度**：768
 - **大小**：~150MB
 - **后端**：fastembed (ONNX-based)
 ### 索引存储
 - **位置**：`C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\_index.db`
 - **大小**：122.57 MB
 - **Schema 版本**：5
 - **文件数**：303
 - **目录数**：26
 ---
 **生成时间**：2025-12-17  
 **CodexLens 版本**：从当前安装中检测
--- a/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
+++ b/ccw/SMART_SEARCH_CORRECTED_ANALYSIS.md
@@ -0,0 +1,330 @@
 # Smart Search 索引分析报告（修正版）
 ## 用户质疑
 1. ❓ 为什么不为代码文件生成向量 embeddings？
 2. ❓ Exact FTS 和 Vector 索引内容应该一样才对
 3. ❓ init 应该返回 FTS 和 vector 索引概况
 **结论：用户的质疑 100% 正确！这是 CodexLens 的设计缺陷。**
 ---
 ## 真实情况
 ### 1. 分层索引架构
 CodexLens 使用**分层目录索引**：
 ```
 D:\Claude_dms3\ccw\
 ├── _index.db             ← 根目录索引（5个文件）
 ├── src/
 │   ├── _index.db         ← src目录索引（2个文件）
 │   ├── tools/
 │   │   └── _index.db     ← tools子目录索引（25个文件）
 │   └── ...
 └── ... （总共 26 个 _index.db）
 ```
 ### 2. 索引覆盖情况
 | 目录 | 文件数 | FTS索引 | Embeddings |
 |------|--------|---------|------------|
 | **根目录** | 5 | ✅ | ✅ (10 chunks) |
 | bin/ | 2 | ✅ | ❌ 无semantic_chunks表 |
 | dist/ | 4 | ✅ | ❌ 无semantic_chunks表 |
 | dist/commands/ | 24 | ✅ | ❌ 无semantic_chunks表 |
 | dist/tools/ | 50 | ✅ | ❌ 无semantic_chunks表 |
 | src/tools/ | 25 | ✅ | ❌ 无semantic_chunks表 |
 | src/commands/ | 12 | ✅ | ❌ 无semantic_chunks表 |
 | ... | ... | ... | ... |
 | **总计** | **303** | **✅ 100%** | **❌ 1.6%** (5/303) |
 ### 3. 关键发现
 ```python
 # 运行检查脚本的结果
 Total index databases: 26
 Directories with embeddings: 1        # ❌ 只有根目录！
 Total files indexed: 303              # ✅ FTS索引完整
 Total semantic chunks: 10             # ❌ 只有根目录的5个文件
 ```
 **问题**：
 - ✅ **所有303个文件**都有 FTS 索引（分布在26个_index.db中）
 - ❌ **只有5个文件**（1.6%）有 vector embeddings
 - ❌ **25个子目录**的_index.db根本没有`semantic_chunks`表结构
 ---
 ## 为什么会这样？
 ### 原因分析
 1. **`init` 操作**：
   ```bash
   codexlens init .
   ```
   - ✅ 为所有303个文件创建 FTS 索引（分布式）
   - ⚠️ 尝试生成 embeddings，但遇到"Index already has 10 chunks"警告
   - ❌ 只为根目录生成了 embeddings
 2. **`embeddings-generate` 操作**：
   ```bash
   codexlens embeddings-generate . --force
   ```
   - ❌ 只处理了根目录的 _index.db
   - ❌ **未递归处理子目录的索引**
   - 结果：只有5个文档文件有 embeddings
 ### 设计问题
 **CodexLens 的 embeddings 架构有缺陷**：
 ```python
 # 期望行为
 for each _index.db in project:
    generate_embeddings(index_db)
 # 实际行为  
 generate_embeddings(root_index_db_only)
 ```
 ---
 ## Init 返回信息缺陷
 ### 当前 `init` 的返回
 ```json
 {
  "success": true,
  "message": "CodexLens index created successfully for d:\\Claude_dms3\\ccw"
 }
 ```
 **问题**：
 - ❌ 没有说明索引了多少文件
 - ❌ 没有说明是否生成了 embeddings
 - ❌ 没有说明 embeddings 覆盖率
 ### 应该返回的信息
 ```json
 {
  "success": true,
  "message": "Index created successfully",
  "stats": {
    "total_files": 303,
    "total_directories": 26,
    "index_databases": 26,
    "fts_coverage": {
      "files": 303,
      "percentage": 100.0
    },
    "embeddings_coverage": {
      "files": 5,
      "chunks": 10,
      "percentage": 1.6,
      "warning": "Embeddings only generated for root directory. Run embeddings-generate on each subdir for full coverage."
    },
    "features": {
      "exact_fts": true,
      "fuzzy_fts": false,
      "vector_search": "partial"
    }
  }
 }
 ```
 ---
 ## 解决方案
 ### 方案 1：递归生成 Embeddings（推荐）
 ```bash
 # 为所有子目录生成 embeddings
 find .codexlens/indexes -name "_index.db" -exec \
  python -m codexlens embeddings-generate {} --force \;
 ```
 ### 方案 2：改进 Init 命令
 ```python
 # codexlens/cli.py
 def init_with_embeddings(project_root):
    """Initialize with recursive embeddings generation"""
    # 1. Build FTS indexes (current behavior)
    build_indexes(project_root)
    # 2. Generate embeddings for ALL subdirs
    for index_db in find_all_index_dbs(project_root):
        if has_semantic_deps():
            generate_embeddings(index_db)
    # 3. Return comprehensive stats
    return {
        "fts_coverage": get_fts_stats(),
        "embeddings_coverage": get_embeddings_stats(),
        "features": detect_features()
    }
 ```
 ### 方案 3：Smart Search 路由改进
 ```python
 # 当前逻辑
 def classify_intent(query, hasIndex):
    if not hasIndex:
        return "ripgrep"
    elif is_natural_language(query):
        return "hybrid"  # ❌ 但只有5个文件有embeddings！
    else:
        return "exact"
 # 改进逻辑
 def classify_intent(query, indexStatus):
    embeddings_coverage = indexStatus.embeddings_coverage_percent
    if embeddings_coverage < 50:
        # 如果覆盖率<50%，即使是自然语言也降级到exact
        return "exact" if indexStatus.indexed else "ripgrep"
    elif is_natural_language(query):
        return "hybrid"
    else:
        return "exact"
 ```
 ---
 ## 验证用户质疑
 ### ❓ 为什么不为代码文件生成 embeddings？
 **答**：不是"不为代码文件生成"，而是：
 - ✅ 代码文件都有 FTS 索引
 - ❌ `embeddings-generate` 命令有BUG，**只处理根目录**
 - ❌ 子目录的索引数据库甚至**没有创建 semantic_chunks 表**
 ### ❓ FTS 和 Vector 应该索引相同内容
 **答**：**完全正确！** 当前实际情况：
 - FTS: 303/303 (100%)
 - Vector: 5/303 (1.6%)
 **这是严重的不一致性，违背了设计原则。**
 ### ❓ Init 应该返回索引概况
 **答**：**完全正确！** 当前 init 只返回简单成功消息，应该返回：
 - FTS 索引统计
 - Embeddings 覆盖率
 - 功能特性状态
 - 警告信息（如果覆盖不完整）
 ---
 ## 测试验证
 ### Hybrid Search 的实际效果
 ```javascript
 // 当前查询
 smart_search(query="authentication patterns", mode="hybrid")
 // 实际搜索范围：
 // ✅ 可搜索的文件：5个（根目录的.md文件）
 // ❌ 不可搜索的文件：298个代码文件
 // 结果：返回的都是文档文件，代码文件被忽略
 ```
 ### 修复后的效果（理想状态）
 ```javascript
 // 修复后
 smart_search(query="authentication patterns", mode="hybrid")
 // 实际搜索范围：
 // ✅ 可搜索的文件：303个（所有文件）
 // 结果：包含代码文件和文档文件的综合结果
 ```
 ---
 ## 建议的修复优先级
 ### P0 - 紧急修复
 1. **修复 `embeddings-generate` 命令**
   - 递归处理所有子目录的 _index.db
   - 为每个 _index.db 创建 semantic_chunks 表
 2. **改进 `init` 返回信息**
   - 返回详细的索引统计
   - 显示 embeddings 覆盖率
   - 如果覆盖不完整，给出警告
 ### P1 - 重要改进
 3. **Smart Search 自适应路由**
   - 检查 embeddings 覆盖率
   - 如果覆盖率低，自动降级到 exact 模式
 4. **Status 命令增强**
   - 显示每个子目录的索引状态
   - 显示 embeddings 分布情况
 ---
 ## 临时解决方案
 ### 当前推荐使用方式
 ```javascript
 // 1. 文档搜索 - 使用 hybrid（有embeddings）
 smart_search(query="architecture design patterns", mode="hybrid")
 // 2. 代码搜索 - 使用 exact（无embeddings，但有FTS）
 smart_search(query="function executeQuery", mode="exact")
 // 3. 快速搜索 - 使用 ripgrep（跨所有文件）
 smart_search(query="TODO", mode="ripgrep")
 ```
 ### 完整覆盖的变通方案
 ```bash
 # 手动为所有子目录生成 embeddings（如果CodexLens支持）
 cd D:\Claude_dms3\ccw
 # 为每个子目录分别运行
 python -m codexlens embeddings-generate ./src/tools --force
 python -m codexlens embeddings-generate ./src/commands --force
 # ... 重复26次
 # 或使用脚本自动化
 python check_embeddings.py --generate-all
 ```
 ---
 ## 总结
 | 用户质疑 | 状态 | 结论 |
 |---------|------|------|
 | 为什么不对代码生成embeddings？ | ✅ 正确 | 是BUG，不是设计 |
 | FTS和Vector应该内容一致 | ✅ 正确 | 当前严重不一致 |
 | Init应返回详细概况 | ✅ 正确 | 当前信息不足 |
 **用户的所有质疑都是正确的，揭示了 CodexLens 的三个核心问题：**
 1. **Embeddings 生成不完整**（只有1.6%覆盖率）
 2. **索引一致性问题**（FTS vs Vector）
 3. **返回信息不透明**（缺少统计数据）
 ---
 **生成时间**：2025-12-17  
 **验证方法**：`python check_embeddings.py`
--- a/ccw/check_embeddings.py
+++ b/ccw/check_embeddings.py
@@ -0,0 +1,47 @@
 import sqlite3
 import os
 # Find all _index.db files
 root_dir = r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw'
 index_files = []
 for dirpath, dirnames, filenames in os.walk(root_dir):
    if '_index.db' in filenames:
        index_files.append(os.path.join(dirpath, '_index.db'))
 print(f'Found {len(index_files)} index databases\n')
 total_files = 0
 total_chunks = 0
 dirs_with_chunks = 0
 for db_path in sorted(index_files):
    rel_path = db_path.replace(r'C:\Users\dyw\.codexlens\indexes\D\Claude_dms3\ccw\\', '')
    conn = sqlite3.connect(db_path)
    try:
        cursor = conn.execute('SELECT COUNT(*) FROM files')
        file_count = cursor.fetchone()[0]
        total_files += file_count
        try:
            cursor = conn.execute('SELECT COUNT(*) FROM semantic_chunks')
            chunk_count = cursor.fetchone()[0]
            total_chunks += chunk_count
            if chunk_count > 0:
                dirs_with_chunks += 1
                print(f'[+] {rel_path:<40} Files: {file_count:3d}  Chunks: {chunk_count:3d}')
            else:
                print(f'[ ] {rel_path:<40} Files: {file_count:3d}  (no chunks)')
        except sqlite3.OperationalError:
            print(f'[ ] {rel_path:<40} Files: {file_count:3d}  (no semantic_chunks table)')
    except Exception as e:
        print(f'[!] {rel_path:<40} Error: {e}')
    finally:
        conn.close()
 print(f'\n=== Summary ===')
 print(f'Total index databases: {len(index_files)}')
 print(f'Directories with embeddings: {dirs_with_chunks}')
 print(f'Total files indexed: {total_files}')
 print(f'Total semantic chunks: {total_chunks}')
--- a/ccw/src/tools/smart-search.ts
+++ b/ccw/src/tools/smart-search.ts
@@ -1,12 +1,17 @@
 /**
- * Smart Search Tool - Unified search with mode-based execution
+ * Smart Search Tool - Unified intelligent search with CodexLens integration
 * Modes: auto, exact, fuzzy, semantic, graph
 *
 * Features:
- * - Intent classification (auto mode)
+ * - Intent classification with automatic mode selection
- * - Multi-backend search routing
+ * - CodexLens integration (init, hybrid, vector, semantic)
- * - Result fusion with RRF ranking
+ * - Ripgrep fallback for exact mode
- * - Configurable search parameters
+ * - Index status checking and warnings
 * - Multi-backend search routing with RRF ranking
 *
 * Actions:
 * - init: Initialize CodexLens index
 * - search: Intelligent search with auto mode selection
 * - status: Check index status
 */
 import { z } from 'zod';
@@ -19,19 +24,23 @@ import {
 // Define Zod schema for validation
 const ParamsSchema = z.object({
-  query: z.string().min(1, 'Query is required'),
+  action: z.enum(['init', 'search', 'search_files', 'status']).default('search'),
-  mode: z.enum(['auto', 'exact', 'fuzzy', 'semantic', 'graph']).default('auto'),
+  query: z.string().optional(),
  mode: z.enum(['auto', 'hybrid', 'exact', 'ripgrep']).default('auto'),
  output_mode: z.enum(['full', 'files_only', 'count']).default('full'),
  path: z.string().optional(),
  paths: z.array(z.string()).default([]),
  contextLines: z.number().default(0),
  maxResults: z.number().default(100),
  includeHidden: z.boolean().default(false),
  languages: z.array(z.string()).optional(),
  limit: z.number().default(100),
 });
 type Params = z.infer<typeof ParamsSchema>;
 // Search mode constants
-const SEARCH_MODES = ['auto', 'exact', 'fuzzy', 'semantic', 'graph'] as const;
+const SEARCH_MODES = ['auto', 'hybrid', 'exact', 'ripgrep'] as const;
 // Classification confidence threshold
 const CONFIDENCE_THRESHOLD = 0.7;
@@ -70,16 +79,89 @@ interface SearchMetadata {
  classified_as?: string;
  confidence?: number;
  reasoning?: string;
  embeddings_coverage_percent?: number;
  warning?: string;
  note?: string;
  index_status?: 'indexed' | 'not_indexed' | 'partial';
 }
 interface SearchResult {
  success: boolean;
-  results?: ExactMatch[] | SemanticMatch[] | GraphMatch[];
+  results?: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown;
  output?: string;
  metadata?: SearchMetadata;
  error?: string;
  status?: unknown;
  message?: string;
 }
 interface IndexStatus {
  indexed: boolean;
  has_embeddings: boolean;
  file_count?: number;
  embeddings_coverage_percent?: number;
  warning?: string;
 }
 /**
 * Check if CodexLens index exists for current directory
 * @param path - Directory path to check
 * @returns Index status
 */
 async function checkIndexStatus(path: string = '.'): Promise<IndexStatus> {
  try {
    const result = await executeCodexLens(['status', '--json'], { cwd: path });
    if (!result.success) {
      return {
        indexed: false,
        has_embeddings: false,
        warning: 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.',
      };
    }
    // Parse status output
    try {
      // Strip ANSI color codes from JSON output
      const cleanOutput = (result.output || '{}').replace(/\x1b\[[0-9;]*m/g, '');
      const status = JSON.parse(cleanOutput);
      const indexed = status.indexed === true || status.file_count > 0;
      // Get embeddings coverage from comprehensive status
      const embeddingsData = status.embeddings || {};
      const embeddingsCoverage = embeddingsData.coverage_percent || 0;
      const has_embeddings = embeddingsCoverage >= 50; // Threshold: 50%
      let warning: string | undefined;
      if (!indexed) {
        warning = 'No CodexLens index found. Run smart_search(action="init") to create index for better search results.';
      } else if (embeddingsCoverage === 0) {
        warning = 'Index exists but no embeddings generated. Run: codexlens embeddings-generate --recursive';
      } else if (embeddingsCoverage < 50) {
        warning = `Embeddings coverage is ${embeddingsCoverage.toFixed(1)}% (below 50%). Hybrid search will use exact mode. Run: codexlens embeddings-generate --recursive`;
      }
      return {
        indexed,
        has_embeddings,
        file_count: status.file_count,
        embeddings_coverage_percent: embeddingsCoverage,
        warning,
      };
    } catch {
      return {
        indexed: false,
        has_embeddings: false,
        warning: 'Failed to parse index status',
      };
    }
  } catch {
    return {
      indexed: false,
      has_embeddings: false,
      warning: 'CodexLens not available',
    };
  }
 }
 /**
@@ -123,43 +205,34 @@ function detectRelationship(query: string): boolean {
 /**
 * Classify query intent and recommend search mode
 * Simple mapping: hybrid (NL + index + embeddings) | exact (index or insufficient embeddings) | ripgrep (no index)
 * @param query - Search query string
 * @param hasIndex - Whether CodexLens index exists
 * @param hasSufficientEmbeddings - Whether embeddings coverage >= 50%
 * @returns Classification result
 */
-function classifyIntent(query: string): Classification {
+function classifyIntent(query: string, hasIndex: boolean = false, hasSufficientEmbeddings: boolean = false): Classification {
-  // Initialize mode scores
+  // Detect query patterns
-  const scores: Record<string, number> = {
+  const isNaturalLanguage = detectNaturalLanguage(query);
    exact: 0,
    fuzzy: 0,
    semantic: 0,
    graph: 0,
  };
-  // Apply detection heuristics with weighted scoring
+  // Simple decision tree
-  if (detectLiteral(query)) {
+  let mode: string;
-    scores.exact += 0.8;
+  let confidence: number;
  if (!hasIndex) {
    // No index: use ripgrep
    mode = 'ripgrep';
    confidence = 1.0;
  } else if (isNaturalLanguage && hasSufficientEmbeddings) {
    // Natural language + sufficient embeddings: use hybrid
    mode = 'hybrid';
    confidence = 0.9;
  } else {
    // Simple query OR insufficient embeddings: use exact
    mode = 'exact';
    confidence = 0.8;
  }
  if (detectRegex(query)) {
    scores.fuzzy += 0.7;
  }
  if (detectNaturalLanguage(query)) {
    scores.semantic += 0.9;
  }
  if (detectFilePath(query)) {
    scores.exact += 0.6;
  }
  if (detectRelationship(query)) {
    scores.graph += 0.85;
  }
  // Find mode with highest confidence score
  const mode = Object.keys(scores).reduce((a, b) => (scores[a] > scores[b] ? a : b));
  const confidence = scores[mode];
  // Build reasoning string
  const detectedPatterns: string[] = [];
  if (detectLiteral(query)) detectedPatterns.push('literal');
@@ -168,7 +241,7 @@ function classifyIntent(query: string): Classification {
  if (detectFilePath(query)) detectedPatterns.push('file path');
  if (detectRelationship(query)) detectedPatterns.push('relationship');
-  const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')})`;
+  const reasoning = `Query classified as ${mode} (confidence: ${confidence.toFixed(2)}, detected: ${detectedPatterns.join(', ')}, index: ${hasIndex ? 'available' : 'not available'}, embeddings: ${hasSufficientEmbeddings ? 'sufficient' : 'insufficient'})`;
  return { mode, confidence, reasoning };
 }
@@ -234,105 +307,192 @@ function buildRipgrepCommand(params: {
 }
 /**
- * Mode: auto - Intent classification and mode selection
+ * Action: init - Initialize CodexLens index
 * Analyzes query to determine optimal search mode
 */
-async function executeAutoMode(params: Params): Promise<SearchResult> {
+async function executeInitAction(params: Params): Promise<SearchResult> {
-  const { query } = params;
+  const { path = '.', languages } = params;
-  // Classify intent
+  // Check CodexLens availability
-  const classification = classifyIntent(query);
+  const readyStatus = await ensureCodexLensReady();
-
+  if (!readyStatus.ready) {
  // Route to appropriate mode based on classification
  switch (classification.mode) {
    case 'exact': {
      const exactResult = await executeExactMode(params);
      return {
        ...exactResult,
        metadata: {
          ...exactResult.metadata!,
          classified_as: classification.mode,
          confidence: classification.confidence,
          reasoning: classification.reasoning,
        },
      };
    }
    case 'fuzzy':
    return {
      success: false,
-        error: 'Fuzzy mode not yet implemented',
+      error: `CodexLens not available: ${readyStatus.error}. CodexLens will be auto-installed on first use.`,
        metadata: {
          mode: 'fuzzy',
          backend: '',
          count: 0,
          query,
          classified_as: classification.mode,
          confidence: classification.confidence,
          reasoning: classification.reasoning,
        },
      };
    case 'semantic': {
      const semanticResult = await executeSemanticMode(params);
      return {
        ...semanticResult,
        metadata: {
          ...semanticResult.metadata!,
          classified_as: classification.mode,
          confidence: classification.confidence,
          reasoning: classification.reasoning,
        },
    };
  }
-    case 'graph': {
+  const args = ['init', path];
-      const graphResult = await executeGraphMode(params);
+  if (languages && languages.length > 0) {
-      return {
+    args.push('--languages', languages.join(','));
        ...graphResult,
        metadata: {
          ...graphResult.metadata!,
          classified_as: classification.mode,
          confidence: classification.confidence,
          reasoning: classification.reasoning,
        },
      };
  }
-    default: {
+  const result = await executeCodexLens(args, { cwd: path, timeout: 300000 });
-      const fallbackResult = await executeExactMode(params);
+
  return {
-        ...fallbackResult,
+    success: result.success,
-        metadata: {
+    error: result.error,
-          ...fallbackResult.metadata!,
+    message: result.success
-          classified_as: 'exact',
+      ? `CodexLens index created successfully for ${path}`
-          confidence: 0.5,
+      : undefined,
          reasoning: 'Fallback to exact mode due to unknown classification',
        },
  };
 }
  }
 }
 /**
- * Mode: exact - Precise file path and content matching
+ * Action: status - Check CodexLens index status
 * Uses ripgrep for literal string matching
 */
-async function executeExactMode(params: Params): Promise<SearchResult> {
+async function executeStatusAction(params: Params): Promise<SearchResult> {
-  const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false } = params;
+  const { path = '.' } = params;
  const indexStatus = await checkIndexStatus(path);
  // Check ripgrep availability
  if (!checkToolAvailability('rg')) {
  return {
-      success: false,
+    success: true,
-      error: 'ripgrep not available - please install ripgrep (rg) to use exact search mode',
+    status: indexStatus,
    message: indexStatus.warning || `Index status: ${indexStatus.indexed ? 'indexed' : 'not indexed'}, embeddings: ${indexStatus.has_embeddings ? 'available' : 'not available'}`,
  };
 }
-  // Build ripgrep command
+/**
 * Mode: auto - Intent classification and mode selection
 * Routes to: hybrid (NL + index) | exact (index) | ripgrep (no index)
 */
 async function executeAutoMode(params: Params): Promise<SearchResult> {
  const { query, path = '.' } = params;
  if (!query) {
    return {
      success: false,
      error: 'Query is required for search action',
    };
  }
  // Check index status
  const indexStatus = await checkIndexStatus(path);
  // Classify intent with index and embeddings awareness
  const classification = classifyIntent(
    query, 
    indexStatus.indexed, 
    indexStatus.has_embeddings  // This now considers 50% threshold
  );
  // Route to appropriate mode based on classification
  let result: SearchResult;
  switch (classification.mode) {
    case 'hybrid':
      result = await executeHybridMode(params);
      break;
    case 'exact':
      result = await executeCodexLensExactMode(params);
      break;
    case 'ripgrep':
      result = await executeRipgrepMode(params);
      break;
    default:
      // Fallback to ripgrep
      result = await executeRipgrepMode(params);
      break;
  }
  // Add classification metadata
  if (result.metadata) {
    result.metadata.classified_as = classification.mode;
    result.metadata.confidence = classification.confidence;
    result.metadata.reasoning = classification.reasoning;
    result.metadata.embeddings_coverage_percent = indexStatus.embeddings_coverage_percent;
    result.metadata.index_status = indexStatus.indexed
      ? (indexStatus.has_embeddings ? 'indexed' : 'partial')
      : 'not_indexed';
    // Add warning if needed
    if (indexStatus.warning) {
      result.metadata.warning = indexStatus.warning;
    }
  }
  return result;
 }
 /**
 * Mode: ripgrep - Fast literal string matching using ripgrep
 * No index required, fallback to CodexLens if ripgrep unavailable
 */
 async function executeRipgrepMode(params: Params): Promise<SearchResult> {
  const { query, paths = [], contextLines = 0, maxResults = 100, includeHidden = false, path = '.' } = params;
  if (!query) {
    return {
      success: false,
      error: 'Query is required for search',
    };
  }
  // Check if ripgrep is available
  const hasRipgrep = checkToolAvailability('rg');
  // If ripgrep not available, fall back to CodexLens exact mode
  if (!hasRipgrep) {
    const readyStatus = await ensureCodexLensReady();
    if (!readyStatus.ready) {
      return {
        success: false,
        error: 'Neither ripgrep nor CodexLens available. Install ripgrep (rg) or CodexLens for search functionality.',
      };
    }
    // Use CodexLens exact mode as fallback
    const args = ['search', query, '--limit', maxResults.toString(), '--mode', 'exact', '--json'];
    const result = await executeCodexLens(args, { cwd: path });
    if (!result.success) {
      return {
        success: false,
        error: result.error,
        metadata: {
          mode: 'ripgrep',
          backend: 'codexlens-fallback',
          count: 0,
          query,
        },
      };
    }
    // Parse results
    let results: SemanticMatch[] = [];
    try {
      const parsed = JSON.parse(result.output || '{}');
      const data = parsed.results || parsed;
      results = (Array.isArray(data) ? data : []).map((item: any) => ({
        file: item.path || item.file,
        score: item.score || 0,
        content: item.excerpt || item.content || '',
        symbol: item.symbol || null,
      }));
    } catch {
      // Keep empty results
    }
    return {
      success: true,
      results,
      metadata: {
        mode: 'ripgrep',
        backend: 'codexlens-fallback',
        count: results.length,
        query,
        note: 'Using CodexLens exact mode (ripgrep not available)',
      },
    };
  }
  // Use ripgrep
  const { command, args } = buildRipgrepCommand({
    query,
-    paths: paths.length > 0 ? paths : ['.'],
+    paths: paths.length > 0 ? paths : [path],
    contextLines,
    maxResults,
    includeHidden,
@@ -340,7 +500,7 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
  return new Promise((resolve) => {
    const child = spawn(command, args, {
-      cwd: process.cwd(),
+      cwd: path || process.cwd(),
      stdio: ['ignore', 'pipe', 'pipe'],
    });
@@ -386,7 +546,7 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
          success: true,
          results,
          metadata: {
-            mode: 'exact',
+            mode: 'ripgrep',
            backend: 'ripgrep',
            count: results.length,
            query,
@@ -412,60 +572,126 @@ async function executeExactMode(params: Params): Promise<SearchResult> {
 }
 /**
- * Mode: fuzzy - Approximate matching with tolerance
+ * Mode: exact - CodexLens exact/FTS search
- * Uses fuzzy matching algorithms for typo-tolerant search
+ * Requires index
 */
-async function executeFuzzyMode(params: Params): Promise<SearchResult> {
+async function executeCodexLensExactMode(params: Params): Promise<SearchResult> {
  const { query, path = '.', limit = 100 } = params;
  if (!query) {
    return {
      success: false,
-    error: 'Fuzzy mode not implemented - fuzzy matching engine pending',
+      error: 'Query is required for search',
    };
  }
 /**
 * Mode: semantic - Natural language understanding search
 * Uses CodexLens embeddings for semantic similarity
 */
 async function executeSemanticMode(params: Params): Promise<SearchResult> {
  const { query, paths = [], maxResults = 100 } = params;
  // Check CodexLens availability
  const readyStatus = await ensureCodexLensReady();
  if (!readyStatus.ready) {
    return {
      success: false,
-      error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`,
+      error: `CodexLens not available: ${readyStatus.error}`,
    };
  }
-  // Determine search path
+  // Check index status
-  const searchPath = paths.length > 0 ? paths[0] : '.';
+  const indexStatus = await checkIndexStatus(path);
-  // Execute CodexLens semantic search
+  const args = ['search', query, '--limit', limit.toString(), '--mode', 'exact', '--json'];
-  const result = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], {
+  const result = await executeCodexLens(args, { cwd: path });
    cwd: searchPath,
  });
  if (!result.success) {
    return {
      success: false,
      error: result.error,
      metadata: {
-        mode: 'semantic',
+        mode: 'exact',
        backend: 'codexlens',
        count: 0,
        query,
        warning: indexStatus.warning,
      },
    };
  }
-  // Parse and transform results
+  // Parse results
  let results: SemanticMatch[] = [];
  try {
-    const cleanOutput = result.output!.replace(/\r\n/g, '\n');
+    const parsed = JSON.parse(result.output || '{}');
-    const parsed = JSON.parse(cleanOutput);
+    const data = parsed.results || parsed;
-    const data = parsed.result || parsed;
+    results = (Array.isArray(data) ? data : []).map((item: any) => ({
-    results = (data.results || []).map((item: any) => ({
+      file: item.path || item.file,
      score: item.score || 0,
      content: item.excerpt || item.content || '',
      symbol: item.symbol || null,
    }));
  } catch {
    // Keep empty results
  }
  return {
    success: true,
    results,
    metadata: {
      mode: 'exact',
      backend: 'codexlens',
      count: results.length,
      query,
      warning: indexStatus.warning,
    },
  };
 }
 /**
 * Mode: hybrid - Best quality search with RRF fusion
 * Uses CodexLens hybrid mode (exact + fuzzy + vector)
 * Requires index with embeddings
 */
 async function executeHybridMode(params: Params): Promise<SearchResult> {
  const { query, path = '.', limit = 100 } = params;
  if (!query) {
    return {
      success: false,
      error: 'Query is required for search',
    };
  }
  // Check CodexLens availability
  const readyStatus = await ensureCodexLensReady();
  if (!readyStatus.ready) {
    return {
      success: false,
      error: `CodexLens not available: ${readyStatus.error}`,
    };
  }
  // Check index status
  const indexStatus = await checkIndexStatus(path);
  const args = ['search', query, '--limit', limit.toString(), '--mode', 'hybrid', '--json'];
  const result = await executeCodexLens(args, { cwd: path });
  if (!result.success) {
    return {
      success: false,
      error: result.error,
      metadata: {
        mode: 'hybrid',
        backend: 'codexlens',
        count: 0,
        query,
        warning: indexStatus.warning,
      },
    };
  }
  // Parse results
  let results: SemanticMatch[] = [];
  try {
    const parsed = JSON.parse(result.output || '{}');
    const data = parsed.results || parsed;
    results = (Array.isArray(data) ? data : []).map((item: any) => ({
      file: item.path || item.file,
      score: item.score || 0,
      content: item.excerpt || item.content || '',
@@ -477,11 +703,11 @@ async function executeSemanticMode(params: Params): Promise<SearchResult> {
      results: [],
      output: result.output,
      metadata: {
-        mode: 'semantic',
+        mode: 'hybrid',
        backend: 'codexlens',
        count: 0,
        query,
-        warning: 'Failed to parse JSON output',
+        warning: indexStatus.warning || 'Failed to parse JSON output',
      },
    };
  }
@@ -490,105 +716,12 @@ async function executeSemanticMode(params: Params): Promise<SearchResult> {
    success: true,
    results,
    metadata: {
-      mode: 'semantic',
+      mode: 'hybrid',
      backend: 'codexlens',
      count: results.length,
      query,
-    },
+      note: 'Hybrid mode uses RRF fusion (exact + fuzzy + vector) for best results',
-  };
+      warning: indexStatus.warning,
 }
 /**
 * Mode: graph - Dependency and relationship traversal
 * Uses CodexLens symbol extraction for code analysis
 */
 async function executeGraphMode(params: Params): Promise<SearchResult> {
  const { query, paths = [], maxResults = 100 } = params;
  // Check CodexLens availability
  const readyStatus = await ensureCodexLensReady();
  if (!readyStatus.ready) {
    return {
      success: false,
      error: `CodexLens not available: ${readyStatus.error}. Run 'ccw tool exec codex_lens {"action":"bootstrap"}' to install.`,
    };
  }
  // First, search for relevant files using text search
  const searchPath = paths.length > 0 ? paths[0] : '.';
  const textResult = await executeCodexLens(['search', query, '--limit', maxResults.toString(), '--json'], {
    cwd: searchPath,
  });
  if (!textResult.success) {
    return {
      success: false,
      error: textResult.error,
      metadata: {
        mode: 'graph',
        backend: 'codexlens',
        count: 0,
        query,
      },
    };
  }
  // Parse results and extract symbols from top files
  let results: GraphMatch[] = [];
  try {
    const parsed = JSON.parse(textResult.output!);
    const files = [...new Set((parsed.results || parsed).map((item: any) => item.path || item.file))].slice(
      0,
      10
    );
    // Extract symbols from files in parallel
    const symbolPromises = files.map((file) =>
      executeCodexLens(['symbol', file as string, '--json'], { cwd: searchPath }).then((result) => ({
        file,
        result,
      }))
    );
    const symbolResults = await Promise.all(symbolPromises);
    for (const { file, result } of symbolResults) {
      if (result.success) {
        try {
          const symbols = JSON.parse(result.output!);
          results.push({
            file: file as string,
            symbols: symbols.symbols || symbols,
            relationships: [],
          });
        } catch {
          // Skip files with parse errors
        }
      }
    }
  } catch {
    return {
      success: false,
      error: 'Failed to parse search results',
      metadata: {
        mode: 'graph',
        backend: 'codexlens',
        count: 0,
        query,
      },
    };
  }
  return {
    success: true,
    results,
    metadata: {
      mode: 'graph',
      backend: 'codexlens',
      count: results.length,
      query,
      note: 'Graph mode provides symbol extraction; full dependency graph analysis pending',
    },
  };
 }
@@ -596,36 +729,73 @@ async function executeGraphMode(params: Params): Promise<SearchResult> {
 // Tool schema for MCP
 export const schema: ToolSchema = {
  name: 'smart_search',
-  description: `Intelligent code search with multiple modes.
+  description: `Intelligent code search with three optimized modes: hybrid, exact, ripgrep.
-Usage:
+**Quick Start:**
-  smart_search(query="function main", path=".")           # Auto-select mode
+  smart_search(query="authentication logic")           # Auto mode (intelligent routing)
-  smart_search(query="def init", mode="exact")            # Exact match
+  smart_search(action="init", path=".")                # Initialize index (required for hybrid)
-  smart_search(query="authentication logic", mode="semantic")  # NL search
+  smart_search(action="status")                        # Check index status
-Modes: auto (default), exact, fuzzy, semantic, graph`,
+**Three Core Modes:**
  1. auto (default): Intelligent routing based on query and index
     - Natural language + index → hybrid
     - Simple query + index → exact
     - No index → ripgrep
  2. hybrid: CodexLens RRF fusion (exact + fuzzy + vector)
     - Best quality, semantic understanding
     - Requires index with embeddings
  3. exact: CodexLens FTS (full-text search)
     - Precise keyword matching
     - Requires index
  4. ripgrep: Direct ripgrep execution
     - Fast, no index required
     - Literal string matching
 **Actions:**
  - search (default): Intelligent search with auto routing
  - init: Create CodexLens index (required for hybrid/exact)
  - status: Check index and embedding availability
  - search_files: Return file paths only
 **Workflow:**
  1. Run action="init" to create index
  2. Use auto mode - it routes to hybrid for NL queries, exact for simple queries
  3. Use ripgrep mode for fast searches without index`,
  inputSchema: {
    type: 'object',
    properties: {
      action: {
        type: 'string',
        enum: ['init', 'search', 'search_files', 'status'],
        description: 'Action to perform: init (create index), search (default), search_files (paths only), status (check index)',
        default: 'search',
      },
      query: {
        type: 'string',
-        description: 'Search query (file pattern, text content, or natural language)',
+        description: 'Search query (required for search/search_files actions)',
      },
      mode: {
        type: 'string',
        enum: SEARCH_MODES,
-        description: 'Search mode (default: auto)',
+        description: 'Search mode: auto (default), hybrid (best quality), exact (CodexLens FTS), ripgrep (fast, no index)',
        default: 'auto',
      },
      output_mode: {
        type: 'string',
        enum: ['full', 'files_only', 'count'],
-        description: 'Output mode: full (default), files_only (paths only), count (per-file counts)',
+        description: 'Output format: full (default), files_only (paths only), count (per-file counts)',
        default: 'full',
      },
      path: {
        type: 'string',
        description: 'Directory path for init/search actions (default: current directory)',
      },
      paths: {
        type: 'array',
-        description: 'Paths to search within (default: current directory)',
+        description: 'Multiple paths to search within (for search action)',
        items: {
          type: 'string',
        },
@@ -633,21 +803,31 @@ Modes: auto (default), exact, fuzzy, semantic, graph`,
      },
      contextLines: {
        type: 'number',
-        description: 'Number of context lines around matches (default: 0)',
+        description: 'Number of context lines around matches (exact mode only)',
        default: 0,
      },
      maxResults: {
        type: 'number',
-        description: 'Maximum number of results to return (default: 100)',
+        description: 'Maximum number of results (default: 100)',
        default: 100,
      },
      limit: {
        type: 'number',
        description: 'Alias for maxResults',
        default: 100,
      },
      includeHidden: {
        type: 'boolean',
-        description: 'Include hidden files/directories (default: false)',
+        description: 'Include hidden files/directories',
        default: false,
      },
      languages: {
        type: 'array',
        items: { type: 'string' },
        description: 'Languages to index (for init action). Example: ["javascript", "typescript"]',
      },
-    required: ['query'],
+    },
    required: [],
  },
 };
@@ -655,20 +835,27 @@ Modes: auto (default), exact, fuzzy, semantic, graph`,
 * Transform results based on output_mode
 */
 function transformOutput(
-  results: ExactMatch[] | SemanticMatch[] | GraphMatch[],
+  results: ExactMatch[] | SemanticMatch[] | GraphMatch[] | unknown[],
  outputMode: 'full' | 'files_only' | 'count'
 ): unknown {
  if (!Array.isArray(results)) {
    return results;
  }
  switch (outputMode) {
    case 'files_only': {
      // Extract unique file paths
-      const files = [...new Set(results.map((r) => r.file))];
+      const files = [...new Set(results.map((r: any) => r.file))].filter(Boolean);
      return { files, count: files.length };
    }
    case 'count': {
      // Count matches per file
      const counts: Record<string, number> = {};
      for (const r of results) {
-        counts[r.file] = (counts[r.file] || 0) + 1;
+        const file = (r as any).file;
        if (file) {
          counts[file] = (counts[file] || 0) + 1;
        }
      }
      return {
        files: Object.entries(counts).map(([file, count]) => ({ file, count })),
@@ -688,34 +875,58 @@ export async function handler(params: Record<string, unknown>): Promise<ToolResu
    return { success: false, error: `Invalid params: ${parsed.error.message}` };
  }
-  const { mode, output_mode } = parsed.data;
+  const { action, mode, output_mode, limit, maxResults } = parsed.data;
  // Use limit if maxResults not provided
  if (limit && !maxResults) {
    parsed.data.maxResults = limit;
  }
  try {
    let result: SearchResult;
    // Handle actions
    switch (action) {
      case 'init':
        result = await executeInitAction(parsed.data);
        break;
      case 'status':
        result = await executeStatusAction(parsed.data);
        break;
      case 'search_files':
        // For search_files, use search mode but force files_only output
        parsed.data.output_mode = 'files_only';
        // Fall through to search
      case 'search':
      default:
        // Handle search modes: auto | hybrid | exact | ripgrep
        switch (mode) {
          case 'auto':
            result = await executeAutoMode(parsed.data);
            break;
          case 'hybrid':
            result = await executeHybridMode(parsed.data);
            break;
          case 'exact':
-        result = await executeExactMode(parsed.data);
+            result = await executeCodexLensExactMode(parsed.data);
            break;
-      case 'fuzzy':
+          case 'ripgrep':
-        result = await executeFuzzyMode(parsed.data);
+            result = await executeRipgrepMode(parsed.data);
        break;
      case 'semantic':
        result = await executeSemanticMode(parsed.data);
        break;
      case 'graph':
        result = await executeGraphMode(parsed.data);
            break;
          default:
-        throw new Error(`Unsupported mode: ${mode}`);
+            throw new Error(`Unsupported mode: ${mode}. Use: auto, hybrid, exact, or ripgrep`);
        }
        break;
    }
-    // Transform output based on output_mode
+    // Transform output based on output_mode (for search actions only)
    if (action === 'search' || action === 'search_files') {
      if (result.success && result.results && output_mode !== 'full') {
-      result.results = transformOutput(result.results, output_mode) as typeof result.results;
+        result.results = transformOutput(result.results as any[], output_mode);
      }
    }
    return result.success ? { success: true, result } : { success: false, error: result.error };
--- a/codex-lens/src/codexlens/cli/commands.py
+++ b/codex-lens/src/codexlens/cli/commands.py
@@ -142,11 +142,11 @@ def init(
        if not no_embeddings:
            try:
                from codexlens.semantic import SEMANTIC_AVAILABLE
-                from codexlens.cli.embedding_manager import generate_embeddings
+                from codexlens.cli.embedding_manager import generate_embeddings_recursive, get_embeddings_status
                if SEMANTIC_AVAILABLE:
-                    # Find the index file
+                    # Use the index root directory (not the _index.db file)
-                    index_path = Path(build_result.index_root) / "_index.db"
+                    index_root = Path(build_result.index_root)
                    if not json_mode:
                        console.print("\n[bold]Generating embeddings...[/bold]")
@@ -157,8 +157,8 @@ def init(
                        if not json_mode and verbose:
                            console.print(f"  {msg}")
-                    embed_result = generate_embeddings(
+                    embed_result = generate_embeddings_recursive(
-                        index_path,
+                        index_root,
                        model_profile=embedding_model,
                        force=False,  # Don't force regenerate during init
                        chunk_size=2000,
@@ -167,29 +167,56 @@ def init(
                    if embed_result["success"]:
                        embed_data = embed_result["result"]
-                        result["embeddings_generated"] = True
+                        
-                        result["embeddings_count"] = embed_data["chunks_embedded"]
+                        # Get comprehensive coverage statistics
                        status_result = get_embeddings_status(index_root)
                        if status_result["success"]:
                            coverage = status_result["result"]
                            result["embeddings"] = {
                                "generated": True,
                                "total_indexes": coverage["total_indexes"],
                                "total_files": coverage["total_files"],
                                "files_with_embeddings": coverage["files_with_embeddings"],
                                "coverage_percent": coverage["coverage_percent"],
                                "total_chunks": coverage["total_chunks"],
                            }
                        else:
                            result["embeddings"] = {
                                "generated": True,
                                "total_chunks": embed_data["total_chunks_created"],
                                "files_processed": embed_data["total_files_processed"],
                            }
                        if not json_mode:
-                            console.print(f"[green]✓[/green] Generated [bold]{embed_data['chunks_embedded']}[/bold] embeddings in {embed_data['elapsed_time']:.1f}s")
+                            console.print(f"[green]✓[/green] Generated embeddings for [bold]{embed_data['total_files_processed']}[/bold] files")
                            console.print(f"  Total chunks: [bold]{embed_data['total_chunks_created']}[/bold]")
                            console.print(f"  Indexes processed: [bold]{embed_data['indexes_successful']}/{embed_data['indexes_processed']}[/bold]")
                    else:
                        if not json_mode:
                            console.print(f"[yellow]Warning:[/yellow] Embedding generation failed: {embed_result.get('error', 'Unknown error')}")
-                        result["embeddings_generated"] = False
+                        result["embeddings"] = {
-                        result["embeddings_error"] = embed_result.get("error")
+                            "generated": False,
                            "error": embed_result.get("error"),
                        }
                else:
                    if not json_mode and verbose:
                        console.print("[dim]Semantic search not available. Skipping embeddings.[/dim]")
-                    result["embeddings_generated"] = False
+                    result["embeddings"] = {
-                    result["embeddings_error"] = "Semantic dependencies not installed"
+                        "generated": False,
                        "error": "Semantic dependencies not installed",
                    }
            except Exception as e:
                if not json_mode and verbose:
                    console.print(f"[yellow]Warning:[/yellow] Could not generate embeddings: {e}")
-                result["embeddings_generated"] = False
+                result["embeddings"] = {
-                result["embeddings_error"] = str(e)
+                    "generated": False,
                    "error": str(e),
                }
        else:
-            result["embeddings_generated"] = False
+            result["embeddings"] = {
-            result["embeddings_error"] = "Skipped (--no-embeddings)"
+                "generated": False,
                "error": "Skipped (--no-embeddings)",
            }
    except StorageError as exc:
        if json_mode:
@@ -611,6 +638,24 @@ def status(
                except Exception:
                    pass
        # Check embeddings coverage
        embeddings_info = None
        has_vector_search = False
        try:
            from codexlens.cli.embedding_manager import get_embeddings_status
            if index_root.exists():
                embed_status = get_embeddings_status(index_root)
                if embed_status["success"]:
                    embeddings_info = embed_status["result"]
                    # Enable vector search if coverage >= 50%
                    has_vector_search = embeddings_info["coverage_percent"] >= 50.0
        except ImportError:
            # Embedding manager not available
            pass
        except Exception as e:
            logger.debug(f"Failed to get embeddings status: {e}")
        stats = {
            "index_root": str(index_root),
            "registry_path": str(_get_registry_path()),
@@ -624,10 +669,14 @@ def status(
                "exact_fts": True,  # Always available
                "fuzzy_fts": has_dual_fts,
                "hybrid_search": has_dual_fts,
-                "vector_search": False,  # Not yet implemented
+                "vector_search": has_vector_search,
            },
        }
        # Add embeddings info if available
        if embeddings_info:
            stats["embeddings"] = embeddings_info
        if json_mode:
            print_json(success=True, result=stats)
        else:
@@ -648,7 +697,20 @@ def status(
            else:
                console.print(f"  Fuzzy FTS: ✗ (run 'migrate' to enable)")
                console.print(f"  Hybrid Search: ✗ (run 'migrate' to enable)")
-            console.print(f"  Vector Search: ✗ (future)")
+            
            if has_vector_search:
                console.print(f"  Vector Search: ✓ (embeddings available)")
            else:
                console.print(f"  Vector Search: ✗ (no embeddings or coverage < 50%)")
            # Display embeddings statistics if available
            if embeddings_info:
                console.print("\n[bold]Embeddings Coverage:[/bold]")
                console.print(f"  Total Indexes: {embeddings_info['total_indexes']}")
                console.print(f"  Total Files: {embeddings_info['total_files']}")
                console.print(f"  Files with Embeddings: {embeddings_info['files_with_embeddings']}")
                console.print(f"  Coverage: {embeddings_info['coverage_percent']:.1f}%")
                console.print(f"  Total Chunks: {embeddings_info['total_chunks']}")
    except StorageError as exc:
        if json_mode:
@@ -1885,6 +1947,12 @@ def embeddings_generate(
        "--chunk-size",
        help="Maximum chunk size in characters.",
    ),
    recursive: bool = typer.Option(
        False,
        "--recursive",
        "-r",
        help="Recursively process all _index.db files in directory tree.",
    ),
    json_mode: bool = typer.Option(False, "--json", help="Output JSON response."),
    verbose: bool = typer.Option(False, "--verbose", "-v", help="Enable verbose output."),
 ) -> None:
@@ -1908,16 +1976,30 @@ def embeddings_generate(
    _configure_logging(verbose)
    try:
-        from codexlens.cli.embedding_manager import generate_embeddings
+        from codexlens.cli.embedding_manager import generate_embeddings, generate_embeddings_recursive
        # Resolve path
        target_path = path.expanduser().resolve()
        # Determine if we should use recursive mode
        use_recursive = False
        index_path = None
        index_root = None
        if target_path.is_file() and target_path.name == "_index.db":
            # Direct index file
            index_path = target_path
            if recursive:
                # Use parent directory for recursive processing
                use_recursive = True
                index_root = target_path.parent
        elif target_path.is_dir():
-            # Try to find index for this project
+            if recursive:
                # Recursive mode: process all _index.db files in directory tree
                use_recursive = True
                index_root = target_path
            else:
                # Non-recursive: Try to find index for this project
                registry = RegistryStore()
                try:
                    registry.initialize()
@@ -1940,9 +2022,22 @@ def embeddings_generate(
                console.print(f"  {msg}")
        console.print(f"[bold]Generating embeddings[/bold]")
        if use_recursive:
            console.print(f"Index root: [dim]{index_root}[/dim]")
            console.print(f"Mode: [yellow]Recursive[/yellow]")
        else:
            console.print(f"Index: [dim]{index_path}[/dim]")
        console.print(f"Model: [cyan]{model}[/cyan]\n")
        if use_recursive:
            result = generate_embeddings_recursive(
                index_root,
                model_profile=model,
                force=force,
                chunk_size=chunk_size,
                progress_callback=progress_update,
            )
        else:
            result = generate_embeddings(
                index_path,
                model_profile=model,
@@ -1968,6 +2063,30 @@ def embeddings_generate(
                raise typer.Exit(code=1)
            data = result["result"]
            if use_recursive:
                # Recursive mode output
                console.print(f"[green]✓[/green] Recursive embeddings generation complete!")
                console.print(f"  Indexes processed: {data['indexes_processed']}")
                console.print(f"  Indexes successful: {data['indexes_successful']}")
                if data['indexes_failed'] > 0:
                    console.print(f"  [yellow]Indexes failed: {data['indexes_failed']}[/yellow]")
                console.print(f"  Total chunks created: {data['total_chunks_created']:,}")
                console.print(f"  Total files processed: {data['total_files_processed']}")
                if data['total_files_failed'] > 0:
                    console.print(f"  [yellow]Total files failed: {data['total_files_failed']}[/yellow]")
                console.print(f"  Model profile: {data['model_profile']}")
                # Show details if verbose
                if verbose and data.get('details'):
                    console.print("\n[dim]Index details:[/dim]")
                    for detail in data['details']:
                        status_icon = "[green]✓[/green]" if detail['success'] else "[red]✗[/red]"
                        console.print(f"  {status_icon} {detail['path']}")
                        if not detail['success'] and detail.get('error'):
                            console.print(f"    [dim]Error: {detail['error']}[/dim]")
            else:
                # Single index mode output
                elapsed = data["elapsed_time"]
                console.print(f"[green]✓[/green] Embeddings generated successfully!")
--- a/codex-lens/src/codexlens/cli/embedding_manager.py
+++ b/codex-lens/src/codexlens/cli/embedding_manager.py
@@ -255,6 +255,21 @@ def generate_embeddings(
    }
 def discover_all_index_dbs(index_root: Path) -> List[Path]:
    """Recursively find all _index.db files in an index tree.
    Args:
        index_root: Root directory to scan for _index.db files
    Returns:
        Sorted list of paths to _index.db files
    """
    if not index_root.exists():
        return []
    return sorted(index_root.rglob("_index.db"))
 def find_all_indexes(scan_dir: Path) -> List[Path]:
    """Find all _index.db files in directory tree.
@@ -270,6 +285,146 @@ def find_all_indexes(scan_dir: Path) -> List[Path]:
    return list(scan_dir.rglob("_index.db"))
 def generate_embeddings_recursive(
    index_root: Path,
    model_profile: str = "code",
    force: bool = False,
    chunk_size: int = 2000,
    progress_callback: Optional[callable] = None,
 ) -> Dict[str, any]:
    """Generate embeddings for all index databases in a project recursively.
    Args:
        index_root: Root index directory containing _index.db files
        model_profile: Model profile (fast, code, multilingual, balanced)
        force: If True, regenerate even if embeddings exist
        chunk_size: Maximum chunk size in characters
        progress_callback: Optional callback for progress updates
    Returns:
        Aggregated result dictionary with generation statistics
    """
    # Discover all _index.db files
    index_files = discover_all_index_dbs(index_root)
    if not index_files:
        return {
            "success": False,
            "error": f"No index databases found in {index_root}",
        }
    if progress_callback:
        progress_callback(f"Found {len(index_files)} index databases to process")
    # Process each index database
    all_results = []
    total_chunks = 0
    total_files_processed = 0
    total_files_failed = 0
    for idx, index_path in enumerate(index_files, 1):
        if progress_callback:
            try:
                rel_path = index_path.relative_to(index_root)
            except ValueError:
                rel_path = index_path
            progress_callback(f"[{idx}/{len(index_files)}] Processing {rel_path}")
        result = generate_embeddings(
            index_path,
            model_profile=model_profile,
            force=force,
            chunk_size=chunk_size,
            progress_callback=None,  # Don't cascade callbacks
        )
        all_results.append({
            "path": str(index_path),
            "success": result["success"],
            "result": result.get("result"),
            "error": result.get("error"),
        })
        if result["success"]:
            data = result["result"]
            total_chunks += data["chunks_created"]
            total_files_processed += data["files_processed"]
            total_files_failed += data["files_failed"]
    successful = sum(1 for r in all_results if r["success"])
    return {
        "success": successful > 0,
        "result": {
            "indexes_processed": len(index_files),
            "indexes_successful": successful,
            "indexes_failed": len(index_files) - successful,
            "total_chunks_created": total_chunks,
            "total_files_processed": total_files_processed,
            "total_files_failed": total_files_failed,
            "model_profile": model_profile,
            "details": all_results,
        },
    }
 def get_embeddings_status(index_root: Path) -> Dict[str, any]:
    """Get comprehensive embeddings coverage status for all indexes.
    Args:
        index_root: Root index directory
    Returns:
        Aggregated status with coverage statistics
    """
    index_files = discover_all_index_dbs(index_root)
    if not index_files:
        return {
            "success": True,
            "result": {
                "total_indexes": 0,
                "total_files": 0,
                "files_with_embeddings": 0,
                "files_without_embeddings": 0,
                "total_chunks": 0,
                "coverage_percent": 0.0,
                "indexes_with_embeddings": 0,
                "indexes_without_embeddings": 0,
            },
        }
    total_files = 0
    files_with_embeddings = 0
    total_chunks = 0
    indexes_with_embeddings = 0
    for index_path in index_files:
        status = check_index_embeddings(index_path)
        if status["success"]:
            result = status["result"]
            total_files += result["total_files"]
            files_with_embeddings += result["files_with_chunks"]
            total_chunks += result["total_chunks"]
            if result["has_embeddings"]:
                indexes_with_embeddings += 1
    return {
        "success": True,
        "result": {
            "total_indexes": len(index_files),
            "total_files": total_files,
            "files_with_embeddings": files_with_embeddings,
            "files_without_embeddings": total_files - files_with_embeddings,
            "total_chunks": total_chunks,
            "coverage_percent": round((files_with_embeddings / total_files * 100) if total_files > 0 else 0, 1),
            "indexes_with_embeddings": indexes_with_embeddings,
            "indexes_without_embeddings": len(index_files) - indexes_with_embeddings,
        },
    }
 def get_embedding_stats_summary(index_root: Path) -> Dict[str, any]:
    """Get summary statistics for all indexes in root directory.