Commit Graph

32 Commits

Author SHA1 Message Date
catlog22
54fd94547c feat: Enhance embedding generation and search capabilities
- Added pre-calculation of estimated chunk count for HNSW capacity in `generate_dense_embeddings_centralized` to optimize indexing performance.
- Implemented binary vector generation with memory-mapped storage for efficient cascade search, including metadata saving.
- Introduced SPLADE sparse index generation with improved handling and metadata storage.
- Updated `ChainSearchEngine` to prefer centralized binary searcher for improved performance and added fallback to legacy binary index.
- Deprecated `BinaryANNIndex` in favor of `BinarySearcher` for better memory management and performance.
- Enhanced `SpladeEncoder` with warmup functionality to reduce latency spikes during first-time inference.
- Improved `SpladeIndex` with cache size adjustments for better query performance.
- Added methods for managing binary vectors in `VectorMetadataStore`, including batch insertion and retrieval.
- Created a new `BinarySearcher` class for efficient binary vector search using Hamming distance, supporting both memory-mapped and database loading modes.
2026-01-02 23:57:55 +08:00
catlog22
c268b531aa feat: Enhance embedding generation to track current index path and improve metadata retrieval 2026-01-02 19:18:26 +08:00
catlog22
0b6e9db8e4 feat: Add centralized vector storage and metadata management for embeddings 2026-01-02 17:18:23 +08:00
catlog22
92ed2524b7 feat: Enhance SPLADE indexing command to support multiple index databases and add chunk ID management 2026-01-02 13:25:23 +08:00
catlog22
e21d801523 feat: Add multi-type embedding backends for cascade retrieval
- Implemented BinaryEmbeddingBackend for fast coarse filtering using 256-dimensional binary vectors.
- Developed DenseEmbeddingBackend for high-precision dense vectors (2048 dimensions) for reranking.
- Created CascadeEmbeddingBackend to combine binary and dense embeddings for two-stage retrieval.
- Introduced utility functions for embedding conversion and distance computation.

chore: Migration 010 - Add multi-vector storage support

- Added 'chunks' table to support multi-vector embeddings for cascade retrieval.
- Included new columns: embedding_binary (256-dim) and embedding_dense (2048-dim) for efficient storage.
- Implemented upgrade and downgrade functions to manage schema changes and data migration.
2026-01-02 10:52:43 +08:00
catlog22
195438d26a feat(splade): add cache directory support for ONNX models and improve thread-local database connection handling 2026-01-01 22:40:00 +08:00
catlog22
5bb01755bc Implement SPLADE sparse encoder and associated database migrations
- Added `splade_encoder.py` for ONNX-optimized SPLADE encoding, including methods for encoding text and batch processing.
- Created `SPLADE_IMPLEMENTATION.md` to document the SPLADE encoder's functionality, design patterns, and integration points.
- Introduced migration script `migration_009_add_splade.py` to add SPLADE metadata and posting list tables to the database.
- Developed `splade_index.py` for managing the SPLADE inverted index, supporting efficient sparse vector retrieval.
- Added verification script `verify_watcher.py` to test FileWatcher event filtering and debouncing functionality.
2026-01-01 17:41:22 +08:00
catlog22
31a45f1f30 Add graph expansion and cross-encoder reranking features
- Implemented GraphExpander to enhance search results with related symbols using precomputed neighbors.
- Added CrossEncoderReranker for second-stage search ranking, allowing for improved result scoring.
- Created migrations to establish necessary database tables for relationships and graph neighbors.
- Developed tests for graph expansion functionality, ensuring related results are populated correctly.
- Enhanced performance benchmarks for cross-encoder reranking latency and graph expansion overhead.
- Updated schema cleanup tests to reflect changes in versioning and deprecated fields.
- Added new test cases for Treesitter parser to validate relationship extraction with alias resolution.
2025-12-31 16:58:59 +08:00
catlog22
3fdd52742b fix(storage): handle rollback failures in batch operations
Adds nested exception handling in add_files() and _migrate_fts_to_external()
to catch and log rollback failures. Uses exception chaining to preserve both
transaction and rollback errors, preventing silent database inconsistency.

Solution-ID: SOL-1735385400010
Issue-ID: ISS-1766921318981-10
Task-ID: T1
2025-12-29 19:08:49 +08:00
catlog22
5d5652c2c5 fix(sqlite-store): improve thread tracking in connection cleanup
Add fallback validation to detect dead threads missed by
threading.enumerate(), ensuring all stale connections are cleaned.

Solution-ID: SOL-1735392000002
Issue-ID: ISS-1766921318981-3
Task-ID: T2
2025-12-29 18:50:22 +08:00
catlog22
b958a1ea96 fix(sqlite-store): add periodic cleanup timer for connection pool
Implement background timer to proactively clean stale connections
every 5 minutes, preventing indefinite accumulation.

Solution-ID: SOL-1735392000002
Issue-ID: ISS-1766921318981-3
Task-ID: T1
2025-12-29 18:43:55 +08:00
catlog22
84d06f4273 fix(registry): normalize path case for comparison on Windows
Adds case normalization for path comparison on Windows to handle
case-insensitive filesystem behavior. Preserves case-sensitivity on Unix.

Fixes: ISS-1766921318981-13

Solution-ID: SOL-1735386000-13
Issue-ID: ISS-1766921318981-13
Task-ID: T1
2025-12-28 21:51:23 +08:00
catlog22
3b842ed290 feat(cli-executor): add streaming option and enhance output handling
- Introduced a `stream` parameter to control output streaming vs. caching.
- Enhanced status determination logic to prioritize valid output over exit codes.
- Updated output structure to include full stdout and stderr when not streaming.

feat(cli-history-store): extend conversation turn schema and migration

- Added `cached`, `stdout_full`, and `stderr_full` fields to the conversation turn schema.
- Implemented database migration to add new columns if they do not exist.
- Updated upsert logic to handle new fields.

feat(codex-lens): implement global symbol index for fast lookups

- Created `GlobalSymbolIndex` class to manage project-wide symbol indexing.
- Added methods for adding, updating, and deleting symbols in the global index.
- Integrated global index updates into directory indexing processes.

feat(codex-lens): optimize search functionality with global index

- Enhanced `ChainSearchEngine` to utilize the global symbol index for faster searches.
- Added configuration option to enable/disable global symbol indexing.
- Updated tests to validate global index functionality and performance.
2025-12-25 22:22:31 +08:00
catlog22
3cd842ca1a fix: ccw package.json removal - add root build script and fix cli.ts path resolution
- Fix cli.ts loadPackageInfo() to try root package.json first (../../package.json)
- Add build script and devDependencies to root package.json
- Remove ccw/package.json and ccw/package-lock.json (no longer needed)
- CodexLens: add config.json support for index_dir configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-23 10:25:15 +08:00
catlog22
e60d793c8c fix: 修复 SmartSearch 的 ripgrep limit 和 FTS 分词器问题
- Ripgrep 模式: 添加总结果数量限制,防止返回超过 2MB 数据
  - --max-count 只限制每个文件的匹配数,现在在收集结果时应用 limit
  - 达到限制时在 metadata 中添加 warning 提示

- FTS 分词器: 将点号(.)添加到 tokenchars,修复 PortRole.FLOW 等带点号标识符的精确搜索
  - 更新 dir_index.py 和 migration_004_dual_fts.py 中的 tokenize 配置
  - 需要重建索引才能生效

- Exact 模式: 添加 fuzzy 回退,当精确搜索无结果时自动尝试模糊搜索
  - 回退时在 metadata 中标注 fallback: 'fuzzy'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-22 09:50:29 +08:00
catlog22
3e9a309079 refactor: 移除图索引功能,修复内存泄露,优化嵌入生成
主要更改:

1. 移除图索引功能 (graph indexing)
   - 删除 graph_analyzer.py 及相关迁移文件
   - 移除 CLI 的 graph 命令和 --enrich 标志
   - 清理 chain_search.py 中的图查询方法 (370行)
   - 删除相关测试文件

2. 修复嵌入生成内存问题
   - 重构 generate_embeddings.py 使用流式批处理
   - 改用 embedding_manager 的内存安全实现
   - 文件从 548 行精简到 259 行 (52.7% 减少)

3. 修复内存泄露
   - chain_search.py: quick_search 使用 with 语句管理 ChainSearchEngine
   - embedding_manager.py: 使用 with 语句管理 VectorStore
   - vector_store.py: 添加暴力搜索内存警告

4. 代码清理
   - 移除 Symbol 模型的 token_count 和 symbol_type 字段
   - 清理相关测试用例

测试: 760 passed, 7 skipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-21 16:22:03 +08:00
catlog22
3428642d04 feat: Optimize FTS search with batch symbol fetching and improved excerpt generation 2025-12-19 15:35:31 +08:00
catlog22
2f0cce0089 feat: Enhance CodexLens indexing and search capabilities with new CLI options and improved error handling 2025-12-19 15:10:37 +08:00
catlog22
69049e3f45 feat: Return complete method blocks in FTS search results
- Add helper methods to locate match lines and find containing symbols
- Modify search_fts, search_fts_exact, search_fts_fuzzy to return complete
  code blocks (functions/methods/classes) instead of short snippets
- Join with files table to get full content and file_id
- Query symbols table to find the smallest symbol containing the match
- Fall back to context lines when no symbol contains the match
- Add return_full_content and context_lines parameters for flexibility
- Include start_line, end_line, symbol_name, symbol_kind in SearchResult

This improves search result quality by returning semantically meaningful
code blocks rather than arbitrary 20-byte snippets.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 11:38:39 +08:00
catlog22
17af615fe2 Add help view and core memory styles
- Introduced styles for the help view including tab transitions, accordion animations, search highlighting, and responsive design.
- Implemented core memory styles with modal base styles, memory card designs, and knowledge graph visualization.
- Enhanced dark mode support across various components.
- Added loading states and empty state designs for better user experience.
2025-12-18 18:29:45 +08:00
catlog22
51a61bef31 Add parallel search mode and index progress bar
Features:
- CCW smart_search: Add 'parallel' mode that runs hybrid + exact + ripgrep
  simultaneously with RRF (Reciprocal Rank Fusion) for result merging
- Dashboard: Add real-time progress bar for CodexLens index initialization
- MCP: Return progress metadata in init action response
- Codex-lens: Auto-detect optimal worker count for parallel indexing

Changes:
- smart-search.ts: Add parallel mode, RRF fusion, progress tracking
- codex-lens.ts: Add onProgress callback support, progress parsing
- codexlens-routes.ts: Broadcast index progress via WebSocket
- codexlens-manager.js: New index progress modal with real-time updates
- notifications.js: Add WebSocket event handler registration system
- i18n.js: Add English/Chinese translations for progress UI
- index_tree.py: Workers parameter now auto-detects CPU count (max 16)
- commands.py: CLI --workers parameter supports auto-detection

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-17 23:17:15 +08:00
catlog22
df23975a0b Add comprehensive tests for schema cleanup migration and search comparison
- Implement tests for migration 005 to verify removal of deprecated fields in the database schema.
- Ensure that new databases are created with a clean schema.
- Validate that keywords are correctly extracted from the normalized file_keywords table.
- Test symbol insertion without deprecated fields and subdir operations without direct_files.
- Create a detailed search comparison test to evaluate vector search vs hybrid search performance.
- Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality.
- Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2025-12-16 19:27:05 +08:00
catlog22
3da0ef2adb Add comprehensive tests for query parsing and Reciprocal Rank Fusion
- Implemented tests for the QueryParser class, covering various identifier splitting methods (CamelCase, snake_case, kebab-case), OR expansion, and FTS5 operator preservation.
- Added parameterized tests to validate expected token outputs for different query formats.
- Created edge case tests to ensure robustness against unusual input scenarios.
- Developed tests for the Reciprocal Rank Fusion (RRF) algorithm, including score computation, weight handling, and result ranking across multiple sources.
- Included tests for normalization of BM25 scores and tagging search results with source metadata.
2025-12-16 10:20:19 +08:00
catlog22
35485bbbb1 feat: Enhance navigation and cleanup for graph explorer view
- Added a cleanup function to reset the state when navigating away from the graph explorer.
- Updated navigation logic to call the cleanup function before switching views.
- Improved internationalization by adding new translations for graph-related terms.
- Adjusted icon sizes for better UI consistency in the graph explorer.
- Implemented impact analysis button functionality in the graph explorer.
- Refactored CLI tool configuration to use updated model names.
- Enhanced CLI executor to handle prompts correctly for codex commands.
- Introduced code relationship storage for better visualization in the index tree.
- Added support for parsing Markdown and plain text files in the symbol parser.
- Updated tests to reflect changes in language detection logic.
2025-12-15 23:11:01 +08:00
catlog22
97640a517a feat(storage): implement storage manager for centralized management and cleanup
- Added a new Storage Manager component to handle storage statistics, project cleanup, and configuration for CCW centralized storage.
- Introduced functions to calculate directory sizes, get project storage stats, and clean specific or all storage.
- Enhanced SQLiteStore with a public API for executing queries securely.
- Updated tests to utilize the new execute_query method and validate storage management functionalities.
- Improved performance by implementing connection pooling with idle timeout management in SQLiteStore.
- Added new fields (token_count, symbol_type) to the symbols table and adjusted related insertions.
- Enhanced error handling and logging for storage operations.
2025-12-15 17:39:38 +08:00
catlog22
0fe16963cd Add comprehensive tests for tokenizer, performance benchmarks, and TreeSitter parser functionality
- Implemented unit tests for the Tokenizer class, covering various text inputs, edge cases, and fallback mechanisms.
- Created performance benchmarks comparing tiktoken and pure Python implementations for token counting.
- Developed extensive tests for TreeSitterSymbolParser across Python, JavaScript, and TypeScript, ensuring accurate symbol extraction and parsing.
- Added configuration documentation for MCP integration and custom prompts, enhancing usability and flexibility.
- Introduced a refactor script for GraphAnalyzer to streamline future improvements.
2025-12-15 14:36:09 +08:00
catlog22
0529b57694 Implement database migration framework and performance optimizations
- Added active memory configuration for manual interval and Gemini tool.
- Created file modification rules for handling edits and writes.
- Implemented migration manager for managing database schema migrations.
- Added migration 001 to normalize keywords into separate tables.
- Developed tests for validating performance optimizations including keyword normalization, path lookup, and symbol search.
- Created validation script to manually verify optimization implementations.
2025-12-14 18:08:32 +08:00
catlog22
79a2953862 Add comprehensive tests for vector/semantic search functionality
- Implement full coverage tests for Embedder model loading and embedding generation
- Add CRUD operations and caching tests for VectorStore
- Include cosine similarity computation tests
- Validate semantic search accuracy and relevance through various queries
- Establish performance benchmarks for embedding and search operations
- Ensure edge cases and error handling are covered
- Test thread safety and concurrent access scenarios
- Verify availability of semantic search dependencies
2025-12-14 17:17:09 +08:00
catlog22
08dc0a0348 perf(codex-lens): optimize search performance with vectorized operations
Performance Optimizations:
- VectorStore: NumPy vectorized cosine similarity (100x+ faster)
  - Cached embedding matrix with pre-computed norms
  - Lazy content loading for top-k results only
  - Thread-safe cache invalidation
- SQLite: Added PRAGMA mmap_size=30GB for memory-mapped I/O
- FTS5: unicode61 tokenizer with tokenchars='_' for code identifiers
- ChainSearch: files_only fast path skipping snippet generation
- ThreadPoolExecutor: shared pool across searches

New Components:
- DirIndexStore: single-directory index with FTS5 and symbols
- RegistryStore: global project registry with path mappings
- PathMapper: source-to-index path conversion utility
- IndexTreeBuilder: hierarchical index tree construction
- ChainSearchEngine: parallel recursive directory search

Test Coverage:
- 36 comprehensive search functionality tests
- 14 performance benchmark tests
- 296 total tests passing (100% pass rate)

Benchmark Results:
- FTS5 search: 0.23-0.26ms avg (3900-4300 ops/sec)
- Vector search: 1.05-1.54ms avg (650-955 ops/sec)
- Full semantic: 4.56-6.38ms avg per query

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 11:06:24 +08:00
catlog22
c42f91a7fe feat: Add support for Tree-Sitter parsing and enhance SQLite storage performance 2025-12-12 18:40:24 +08:00
catlog22
92d2085b64 Optimize SQLite FTS storage and pooling 2025-12-12 17:40:03 +08:00
catlog22
a393601ec5 feat(codexlens): add CodexLens code indexing platform with incremental updates
- Add CodexLens Python package with SQLite FTS5 search and tree-sitter parsing
- Implement workspace-local index storage (.codexlens/ directory)
- Add incremental update CLI command for efficient file-level index refresh
- Integrate CodexLens with CCW tools (codex_lens action: update)
- Add CodexLens Auto-Sync hook template for automatic index updates on file changes
- Add CodexLens status card in CCW Dashboard CLI Manager with install/init buttons
- Add server APIs: /api/codexlens/status, /api/codexlens/bootstrap, /api/codexlens/init

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 15:02:32 +08:00