Add comprehensive tests for tokenizer, performance benchmarks, and TreeSitter parser functionality

- Implemented unit tests for the Tokenizer class, covering various text inputs, edge cases, and fallback mechanisms. - Created performance benchmarks comparing tiktoken and pure Python implementations for token counting. - Developed extensive tests for TreeSitterSymbolParser across Python, JavaScript, and TypeScript, ensuring accurate symbol extraction and parsing. - Added configuration documentation for MCP integration and custom prompts, enhancing usability and flexibility. - Introduced a refactor script for GraphAnalyzer to streamline future improvements.
2026-02-10 02:24:35 +08:00 · 2025-12-15 14:36:09 +08:00
parent 82dcafff00
commit 0fe16963cd
49 changed files with 9307 additions and 438 deletions
--- a/codex-lens/src/codexlens/config.py
+++ b/codex-lens/src/codexlens/config.py
@@ -83,6 +83,9 @@ class Config:
    llm_timeout_ms: int = 300000
    llm_batch_size: int = 5

+    # Hybrid chunker configuration
+    hybrid_max_chunk_size: int = 2000  # Max characters per chunk before LLM refinement
+    hybrid_llm_refinement: bool = False  # Enable LLM-based semantic boundary refinement
    def __post_init__(self) -> None:
        try:
            self.data_dir = self.data_dir.expanduser().resolve()