feat: add semantic graph design for static code analysis

- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities. - Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies. - Included detailed specifications for nodes and edges in the graph, along with database schema for storage. - Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
2026-03-30 20:21:09 +08:00 · 2025-12-15 09:47:18 +08:00
parent d91477ad80
commit 3ffb907a6f
17 changed files with 4557 additions and 261 deletions
--- a/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
+++ b/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
@@ -0,0 +1,973 @@
+# 多层次分词器设计方案
+
+## 1. 背景与目标
+
+### 1.1 当前问题
+
+当前 `chunker.py` 的两种分词策略存在明显缺陷：
+
+**symbol-based 策略**：
+- ✅ 优点：保持代码逻辑完整性，每个chunk是完整的函数/类
+- ❌ 缺点：粒度不均，超大函数可能达到数百行，影响LLM处理和搜索精度
+
+**sliding-window 策略**：
+- ✅ 优点：chunk大小均匀，覆盖全面
+- ❌ 缺点：破坏逻辑结构，可能将完整的循环/条件块切断
+
+### 1.2 设计目标
+
+实现多层次分词器，同时满足：
+1. **语义完整性**：保持代码逻辑边界的完整性
+2. **粒度可控**：支持从粗粒度（函数级）到细粒度（逻辑块级）的灵活划分
+3. **层级关系**：保留chunk之间的父子关系，支持上下文检索
+4. **高效索引**：优化向量化和检索性能
+
+## 2. 技术架构
+
+### 2.1 两层分词架构
+
+```
+Source Code
+    ↓
+[Layer 1: Symbol-Level Chunking]  ← 使用 tree-sitter AST
+    ↓
+MacroChunks (Functions/Classes)
+    ↓
+[Layer 2: Logic-Block Chunking]   ← AST深度遍历
+    ↓
+MicroChunks (Loops/Conditionals/Blocks)
+    ↓
+Vector Embedding + Indexing
+```
+
+### 2.2 核心组件
+
+```python
+# 新增数据结构
+@dataclass
+class ChunkMetadata:
+    """Chunk元数据"""
+    chunk_id: str
+    parent_id: Optional[str]  # 父chunk ID
+    level: int                 # 层级：1=macro, 2=micro
+    chunk_type: str           # function/class/loop/conditional/try_except
+    file_path: str
+    start_line: int
+    end_line: int
+    symbol_name: Optional[str]
+    context_summary: Optional[str]  # 继承自父chunk的上下文
+
+@dataclass
+class HierarchicalChunk:
+    """层级化的代码块"""
+    metadata: ChunkMetadata
+    content: str
+    embedding: Optional[List[float]] = None
+    children: List['HierarchicalChunk'] = field(default_factory=list)
+```
+
+## 3. 详细实现步骤
+
+### 3.1 第一层：符号级分词（Macro-Chunking）
+
+**实现思路**：复用现有 `code_extractor.py` 逻辑，增强元数据提取。
+
+```python
+class MacroChunker:
+    """第一层分词器：提取顶层符号"""
+
+    def __init__(self):
+        self.parser = Parser()
+        # 加载语言grammar
+
+    def chunk_by_symbols(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """提取顶层函数和类定义"""
+        tree = self.parser.parse(bytes(content, 'utf-8'))
+        root_node = tree.root_node
+
+        chunks = []
+        for node in root_node.children:
+            if node.type in ['function_definition', 'class_definition',
+                           'method_definition']:
+                chunk = self._create_macro_chunk(node, content, file_path)
+                chunks.append(chunk)
+
+        return chunks
+
+    def _create_macro_chunk(
+        self,
+        node,
+        content: str,
+        file_path: str
+    ) -> HierarchicalChunk:
+        """从AST节点创建macro chunk"""
+        start_line = node.start_point[0] + 1
+        end_line = node.end_point[0] + 1
+
+        # 提取符号名称
+        name_node = node.child_by_field_name('name')
+        symbol_name = content[name_node.start_byte:name_node.end_byte]
+
+        # 提取完整代码（包含docstring和装饰器）
+        chunk_content = self._extract_with_context(node, content)
+
+        metadata = ChunkMetadata(
+            chunk_id=f"{file_path}:{start_line}",
+            parent_id=None,
+            level=1,
+            chunk_type=node.type,
+            file_path=file_path,
+            start_line=start_line,
+            end_line=end_line,
+            symbol_name=symbol_name,
+        )
+
+        return HierarchicalChunk(
+            metadata=metadata,
+            content=chunk_content,
+        )
+
+    def _extract_with_context(self, node, content: str) -> str:
+        """提取代码，包含装饰器和docstring"""
+        # 向上查找装饰器
+        start_byte = node.start_byte
+        prev_sibling = node.prev_sibling
+        while prev_sibling and prev_sibling.type == 'decorator':
+            start_byte = prev_sibling.start_byte
+            prev_sibling = prev_sibling.prev_sibling
+
+        return content[start_byte:node.end_byte]
+```
+
+### 3.2 第二层：逻辑块分词（Micro-Chunking）
+
+**实现思路**：在每个macro chunk内部，按逻辑结构进一步划分。
+
+```python
+class MicroChunker:
+    """第二层分词器：提取逻辑块"""
+
+    # 需要划分的逻辑块类型
+    LOGIC_BLOCK_TYPES = {
+        'for_statement',
+        'while_statement',
+        'if_statement',
+        'try_statement',
+        'with_statement',
+    }
+
+    def chunk_logic_blocks(
+        self,
+        macro_chunk: HierarchicalChunk,
+        content: str,
+        max_lines: int = 50  # 大于此行数的macro chunk才进行二次划分
+    ) -> List[HierarchicalChunk]:
+        """在macro chunk内部提取逻辑块"""
+
+        # 小函数不需要二次划分
+        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
+        if total_lines <= max_lines:
+            return []
+
+        tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
+        root_node = tree.root_node
+
+        micro_chunks = []
+        self._traverse_logic_blocks(
+            root_node,
+            macro_chunk,
+            content,
+            micro_chunks
+        )
+
+        return micro_chunks
+
+    def _traverse_logic_blocks(
+        self,
+        node,
+        parent_chunk: HierarchicalChunk,
+        content: str,
+        result: List[HierarchicalChunk]
+    ):
+        """递归遍历AST，提取逻辑块"""
+
+        if node.type in self.LOGIC_BLOCK_TYPES:
+            micro_chunk = self._create_micro_chunk(
+                node,
+                parent_chunk,
+                content
+            )
+            result.append(micro_chunk)
+            parent_chunk.children.append(micro_chunk)
+
+        # 继续遍历子节点
+        for child in node.children:
+            self._traverse_logic_blocks(child, parent_chunk, content, result)
+
+    def _create_micro_chunk(
+        self,
+        node,
+        parent_chunk: HierarchicalChunk,
+        content: str
+    ) -> HierarchicalChunk:
+        """创建micro chunk"""
+
+        # 计算相对于文件的行号
+        start_line = parent_chunk.metadata.start_line + node.start_point[0]
+        end_line = parent_chunk.metadata.start_line + node.end_point[0]
+
+        chunk_content = content[node.start_byte:node.end_byte]
+
+        metadata = ChunkMetadata(
+            chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
+            parent_id=parent_chunk.metadata.chunk_id,
+            level=2,
+            chunk_type=node.type,
+            file_path=parent_chunk.metadata.file_path,
+            start_line=start_line,
+            end_line=end_line,
+            symbol_name=parent_chunk.metadata.symbol_name,  # 继承父符号名
+            context_summary=None,  # 后续由LLM填充
+        )
+
+        return HierarchicalChunk(
+            metadata=metadata,
+            content=chunk_content,
+        )
+```
+
+### 3.3 统一接口：多层次分词器
+
+```python
+class HierarchicalChunker:
+    """多层次分词器统一接口"""
+
+    def __init__(self, config: ChunkConfig = None):
+        self.config = config or ChunkConfig()
+        self.macro_chunker = MacroChunker()
+        self.micro_chunker = MicroChunker()
+
+    def chunk_file(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """对文件进行多层次分词"""
+
+        # 第一层：符号级分词
+        macro_chunks = self.macro_chunker.chunk_by_symbols(
+            content, file_path, language
+        )
+
+        # 第二层：逻辑块分词
+        all_chunks = []
+        for macro_chunk in macro_chunks:
+            all_chunks.append(macro_chunk)
+
+            # 对大函数进行二次划分
+            micro_chunks = self.micro_chunker.chunk_logic_blocks(
+                macro_chunk, content
+            )
+            all_chunks.extend(micro_chunks)
+
+        return all_chunks
+
+    def chunk_file_with_fallback(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """带降级策略的分词"""
+
+        try:
+            return self.chunk_file(content, file_path, language)
+        except Exception as e:
+            logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
+            # 降级到滑动窗口策略
+            return self._fallback_sliding_window(content, file_path, language)
+```
+
+## 4. 数据存储设计
+
+### 4.1 数据库Schema
+
+```sql
+-- chunk表：存储所有层级的chunk
+CREATE TABLE chunks (
+    chunk_id TEXT PRIMARY KEY,
+    parent_id TEXT,           -- 父chunk ID，NULL表示顶层
+    level INTEGER NOT NULL,   -- 1=macro, 2=micro
+    chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
+    file_path TEXT NOT NULL,
+    start_line INTEGER NOT NULL,
+    end_line INTEGER NOT NULL,
+    symbol_name TEXT,
+    content TEXT NOT NULL,
+    content_hash TEXT,        -- 用于检测内容变化
+
+    -- 语义元数据（由LLM生成）
+    summary TEXT,
+    keywords TEXT,            -- JSON数组
+    purpose TEXT,
+
+    -- 向量嵌入
+    embedding BLOB,           -- 存储向量
+
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+
+    FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
+);
+
+-- 索引优化
+CREATE INDEX idx_chunks_file_path ON chunks(file_path);
+CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
+CREATE INDEX idx_chunks_level ON chunks(level);
+CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);
+```
+
+### 4.2 向量索引
+
+使用分层索引策略：
+
+```python
+class HierarchicalVectorStore:
+    """层级化向量存储"""
+
+    def __init__(self, db_path: Path):
+        self.db_path = db_path
+        self.conn = sqlite3.connect(db_path)
+
+    def add_chunk(self, chunk: HierarchicalChunk):
+        """添加chunk及其向量"""
+
+        cursor = self.conn.cursor()
+        cursor.execute("""
+            INSERT INTO chunks (
+                chunk_id, parent_id, level, chunk_type,
+                file_path, start_line, end_line, symbol_name,
+                content, embedding
+            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        """, (
+            chunk.metadata.chunk_id,
+            chunk.metadata.parent_id,
+            chunk.metadata.level,
+            chunk.metadata.chunk_type,
+            chunk.metadata.file_path,
+            chunk.metadata.start_line,
+            chunk.metadata.end_line,
+            chunk.metadata.symbol_name,
+            chunk.content,
+            self._serialize_embedding(chunk.embedding),
+        ))
+
+        self.conn.commit()
+
+    def search_hierarchical(
+        self,
+        query_embedding: List[float],
+        top_k: int = 10,
+        level_weights: Dict[int, float] = None
+    ) -> List[Tuple[HierarchicalChunk, float]]:
+        """层级化检索"""
+
+        # 默认权重：macro chunk权重更高
+        if level_weights is None:
+            level_weights = {1: 1.0, 2: 0.8}
+
+        # 检索所有chunk
+        cursor = self.conn.cursor()
+        cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")
+
+        results = []
+        for row in cursor.fetchall():
+            chunk = self._row_to_chunk(row)
+            similarity = self._cosine_similarity(
+                query_embedding,
+                chunk.embedding
+            )
+
+            # 根据层级应用权重
+            weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
+            results.append((chunk, weighted_score))
+
+        # 按分数排序
+        results.sort(key=lambda x: x[1], reverse=True)
+        return results[:top_k]
+
+    def get_chunk_with_context(
+        self,
+        chunk_id: str
+    ) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
+        """获取chunk及其父chunk（提供上下文）"""
+
+        cursor = self.conn.cursor()
+
+        # 获取chunk本身
+        cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
+        chunk_row = cursor.fetchone()
+        chunk = self._row_to_chunk(chunk_row)
+
+        # 获取父chunk
+        parent = None
+        if chunk.metadata.parent_id:
+            cursor.execute(
+                "SELECT * FROM chunks WHERE chunk_id = ?",
+                (chunk.metadata.parent_id,)
+            )
+            parent_row = cursor.fetchone()
+            if parent_row:
+                parent = self._row_to_chunk(parent_row)
+
+        return chunk, parent
+```
+
+## 5. LLM集成策略
+
+### 5.1 分层生成语义元数据
+
+```python
+class HierarchicalLLMEnhancer:
+    """为层级chunk生成语义元数据"""
+
+    def enhance_hierarchical_chunks(
+        self,
+        chunks: List[HierarchicalChunk]
+    ) -> Dict[str, SemanticMetadata]:
+        """
+        分层处理策略：
+        1. 先处理所有level=1的macro chunks，生成详细摘要
+        2. 再处理level=2的micro chunks，使用父chunk摘要作为上下文
+        """
+
+        results = {}
+
+        # 第一轮：处理macro chunks
+        macro_chunks = [c for c in chunks if c.metadata.level == 1]
+        macro_metadata = self.llm_enhancer.enhance_files([
+            FileData(
+                path=c.metadata.chunk_id,
+                content=c.content,
+                language=self._detect_language(c.metadata.file_path)
+            )
+            for c in macro_chunks
+        ])
+        results.update(macro_metadata)
+
+        # 第二轮：处理micro chunks（带父上下文）
+        micro_chunks = [c for c in chunks if c.metadata.level == 2]
+        for micro_chunk in micro_chunks:
+            parent_id = micro_chunk.metadata.parent_id
+            parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')
+
+            # 构建带上下文的prompt
+            enhanced_prompt = f"""
+Parent Function: {micro_chunk.metadata.symbol_name}
+Parent Summary: {parent_summary}
+
+Code Block ({micro_chunk.metadata.chunk_type}):
+```
+{micro_chunk.content}
+```
+
+Generate a concise summary (1 sentence) and keywords for this specific code block.
+"""
+
+            metadata = self._call_llm_with_context(enhanced_prompt)
+            results[micro_chunk.metadata.chunk_id] = metadata
+
+        return results
+```
+
+### 5.2 Prompt优化
+
+针对不同层级使用不同的prompt模板：
+
+**Macro Chunk Prompt (Level 1)**:
+```
+PURPOSE: Generate comprehensive semantic metadata for a complete function/class
+TASK:
+- Provide a detailed summary (2-3 sentences) covering what the code does and why
+- Extract 8-12 relevant keywords including technical terms and domain concepts
+- Identify the primary purpose/category
+MODE: analysis
+
+CODE:
+```{language}
+{content}
+```
+
+OUTPUT: JSON with summary, keywords, purpose
+```
+
+**Micro Chunk Prompt (Level 2)**:
+```
+PURPOSE: Summarize a specific logic block within a larger function
+CONTEXT:
+- Parent Function: {symbol_name}
+- Parent Purpose: {parent_summary}
+
+TASK:
+- Provide a brief summary (1 sentence) of this specific block's role in the parent function
+- Extract 3-5 keywords specific to this block's logic
+MODE: analysis
+
+CODE BLOCK ({chunk_type}):
+```{language}
+{content}
+```
+
+OUTPUT: JSON with summary, keywords
+```
+
+## 6. 检索增强
+
+### 6.1 上下文扩展检索
+
+```python
+class ContextualSearchEngine:
+    """支持上下文扩展的检索引擎"""
+
+    def search_with_context(
+        self,
+        query: str,
+        top_k: int = 10,
+        expand_context: bool = True
+    ) -> List[SearchResult]:
+        """
+        检索并自动扩展上下文
+
+        如果匹配到micro chunk，自动返回其父macro chunk作为上下文
+        """
+
+        # 生成查询向量
+        query_embedding = self.embedder.embed_single(query)
+
+        # 层级化检索
+        raw_results = self.vector_store.search_hierarchical(
+            query_embedding,
+            top_k=top_k
+        )
+
+        # 扩展上下文
+        enriched_results = []
+        for chunk, score in raw_results:
+            result = SearchResult(
+                path=chunk.metadata.file_path,
+                score=score,
+                content=chunk.content,
+                start_line=chunk.metadata.start_line,
+                end_line=chunk.metadata.end_line,
+                symbol_name=chunk.metadata.symbol_name,
+            )
+
+            # 如果是micro chunk，获取父chunk作为上下文
+            if expand_context and chunk.metadata.level == 2:
+                parent_chunk, _ = self.vector_store.get_chunk_with_context(
+                    chunk.metadata.chunk_id
+                )
+                if parent_chunk:
+                    result.metadata['parent_context'] = {
+                        'summary': parent_chunk.metadata.context_summary,
+                        'symbol_name': parent_chunk.metadata.symbol_name,
+                        'content': parent_chunk.content,
+                    }
+
+            enriched_results.append(result)
+
+        return enriched_results
+```
+
+## 7. 测试策略
+
+### 7.1 单元测试
+
+```python
+import pytest
+from codexlens.semantic.hierarchical_chunker import (
+    HierarchicalChunker, MacroChunker, MicroChunker
+)
+
+class TestMacroChunker:
+    """测试第一层分词"""
+
+    def test_extract_functions(self):
+        """测试提取函数定义"""
+        code = '''
+def calculate_total(items):
+    """Calculate total price."""
+    total = 0
+    for item in items:
+        total += item.price
+    return total
+
+def apply_discount(total, discount):
+    """Apply discount to total."""
+    return total * (1 - discount)
+'''
+        chunker = MacroChunker()
+        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        assert len(chunks) == 2
+        assert chunks[0].metadata.symbol_name == 'calculate_total'
+        assert chunks[1].metadata.symbol_name == 'apply_discount'
+        assert chunks[0].metadata.level == 1
+
+    def test_extract_with_decorators(self):
+        """测试提取带装饰器的函数"""
+        code = '''
+@app.route('/api/users')
+@auth_required
+def get_users():
+    return User.query.all()
+'''
+        chunker = MacroChunker()
+        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        assert len(chunks) == 1
+        assert '@app.route' in chunks[0].content
+        assert '@auth_required' in chunks[0].content
+
+class TestMicroChunker:
+    """测试第二层分词"""
+
+    def test_extract_loop_blocks(self):
+        """测试提取循环块"""
+        code = '''
+def process_items(items):
+    results = []
+    for item in items:
+        if item.active:
+            results.append(process(item))
+    return results
+'''
+        macro_chunker = MacroChunker()
+        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        micro_chunker = MicroChunker()
+        micro_chunks = micro_chunker.chunk_logic_blocks(
+            macro_chunks[0], code
+        )
+
+        # 应该提取出for循环和if条件块
+        assert len(micro_chunks) >= 1
+        assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)
+
+    def test_skip_small_functions(self):
+        """测试小函数跳过二次划分"""
+        code = '''
+def small_func(x):
+    return x * 2
+'''
+        macro_chunker = MacroChunker()
+        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        micro_chunker = MicroChunker()
+        micro_chunks = micro_chunker.chunk_logic_blocks(
+            macro_chunks[0], code, max_lines=10
+        )
+
+        # 小函数不应该被二次划分
+        assert len(micro_chunks) == 0
+
+class TestHierarchicalChunker:
+    """测试完整的多层次分词"""
+
+    def test_full_hierarchical_chunking(self):
+        """测试完整的层级分词流程"""
+        code = '''
+def complex_function(data):
+    """A complex function with multiple logic blocks."""
+
+    # Validation
+    if not data:
+        raise ValueError("Data is empty")
+
+    # Processing
+    results = []
+    for item in data:
+        try:
+            processed = process_item(item)
+            results.append(processed)
+        except Exception as e:
+            logger.error(f"Failed to process: {e}")
+            continue
+
+    # Aggregation
+    total = sum(r.value for r in results)
+    return total
+'''
+        chunker = HierarchicalChunker()
+        chunks = chunker.chunk_file(code, 'test.py', 'python')
+
+        # 应该有1个macro chunk和多个micro chunks
+        macro_chunks = [c for c in chunks if c.metadata.level == 1]
+        micro_chunks = [c for c in chunks if c.metadata.level == 2]
+
+        assert len(macro_chunks) == 1
+        assert len(micro_chunks) > 0
+
+        # 验证父子关系
+        for micro in micro_chunks:
+            assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id
+```
+
+### 7.2 集成测试
+
+```python
+class TestHierarchicalIndexing:
+    """测试完整的索引流程"""
+
+    def test_index_and_search(self):
+        """测试分层索引和检索"""
+
+        # 1. 分词
+        chunker = HierarchicalChunker()
+        chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')
+
+        # 2. LLM增强
+        enhancer = HierarchicalLLMEnhancer()
+        metadata = enhancer.enhance_hierarchical_chunks(chunks)
+
+        # 3. 向量化
+        embedder = Embedder()
+        for chunk in chunks:
+            text = metadata[chunk.metadata.chunk_id].summary
+            chunk.embedding = embedder.embed_single(text)
+
+        # 4. 存储
+        vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
+        for chunk in chunks:
+            vector_store.add_chunk(chunk)
+
+        # 5. 检索
+        search_engine = ContextualSearchEngine(vector_store, embedder)
+        results = search_engine.search_with_context(
+            "find loop that processes items",
+            top_k=5
+        )
+
+        # 验证结果
+        assert len(results) > 0
+        assert any(r.metadata.get('parent_context') for r in results)
+```
+
+## 8. 性能优化
+
+### 8.1 批量处理
+
+```python
+class BatchHierarchicalProcessor:
+    """批量处理多个文件的层级分词"""
+
+    def process_files_batch(
+        self,
+        file_paths: List[Path],
+        batch_size: int = 10
+    ):
+        """批量处理，优化LLM调用"""
+
+        all_chunks = []
+
+        # 1. 批量分词
+        for file_path in file_paths:
+            content = file_path.read_text()
+            chunks = self.chunker.chunk_file(
+                content, str(file_path), self._detect_language(file_path)
+            )
+            all_chunks.extend(chunks)
+
+        # 2. 批量LLM增强（减少API调用）
+        macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
+        for i in range(0, len(macro_chunks), batch_size):
+            batch = macro_chunks[i:i+batch_size]
+            self.enhancer.enhance_batch(batch)
+
+        # 3. 批量向量化
+        all_texts = [c.content for c in all_chunks]
+        embeddings = self.embedder.embed_batch(all_texts)
+        for chunk, embedding in zip(all_chunks, embeddings):
+            chunk.embedding = embedding
+
+        # 4. 批量存储
+        self.vector_store.add_chunks_batch(all_chunks)
+```
+
+### 8.2 增量更新
+
+```python
+class IncrementalIndexer:
+    """增量索引器：只处理变化的文件"""
+
+    def update_file(self, file_path: Path):
+        """增量更新单个文件"""
+
+        content = file_path.read_text()
+        content_hash = hashlib.sha256(content.encode()).hexdigest()
+
+        # 检查文件是否变化
+        cursor = self.conn.cursor()
+        cursor.execute("""
+            SELECT content_hash FROM chunks
+            WHERE file_path = ? AND level = 1
+            LIMIT 1
+        """, (str(file_path),))
+
+        row = cursor.fetchone()
+        if row and row[0] == content_hash:
+            logger.info(f"File {file_path} unchanged, skipping")
+            return
+
+        # 删除旧chunk
+        cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))
+
+        # 重新索引
+        chunks = self.chunker.chunk_file(content, str(file_path), 'python')
+        # ... 继续处理
+```
+
+## 9. 潜在问题与解决方案
+
+### 9.1 问题：超大函数的micro chunk过多
+
+**现象**：某些遗留代码函数超过1000行，可能产生几十个micro chunks。
+
+**解决方案**：
+```python
+class AdaptiveMicroChunker:
+    """自适应micro分词：根据函数大小调整策略"""
+
+    def chunk_logic_blocks(self, macro_chunk, content):
+        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
+
+        if total_lines > 500:
+            # 超大函数：只提取顶层逻辑块，不递归
+            return self._extract_top_level_blocks(macro_chunk, content)
+        elif total_lines > 100:
+            # 大函数：递归深度限制为2层
+            return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
+        else:
+            # 正常函数：完全跳过micro chunking
+            return []
+```
+
+### 9.2 问题：tree-sitter解析失败
+
+**现象**：对于语法错误的代码，tree-sitter解析可能失败。
+
+**解决方案**：
+```python
+def chunk_file_with_fallback(self, content, file_path, language):
+    """带降级策略的分词"""
+
+    try:
+        # 尝试层级分词
+        return self.chunk_file(content, file_path, language)
+    except TreeSitterError as e:
+        logger.warning(f"Tree-sitter parsing failed: {e}")
+
+        # 降级到基于正则的简单symbol提取
+        return self._fallback_regex_chunking(content, file_path)
+    except Exception as e:
+        logger.error(f"Chunking failed completely: {e}")
+
+        # 最终降级到滑动窗口
+        return self._fallback_sliding_window(content, file_path, language)
+```
+
+### 9.3 问题：向量存储空间占用
+
+**现象**：每个chunk都存储向量，空间占用可能很大。
+
+**解决方案**：
+- **选择性向量化**：只对macro chunks和重要的micro chunks生成向量
+- **向量压缩**：使用PCA或量化技术减少向量维度
+- **分离存储**：向量存储在专门的向量数据库（如Faiss），SQLite只存元数据
+
+```python
+class SelectiveVectorization:
+    """选择性向量化：减少存储开销"""
+
+    VECTORIZE_CHUNK_TYPES = {
+        'function_definition',   # 总是向量化
+        'class_definition',      # 总是向量化
+        'for_statement',         # 循环块
+        'try_statement',         # 异常处理
+        # 'if_statement' 通常不单独向量化，依赖父chunk
+    }
+
+    def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
+        """判断是否需要为chunk生成向量"""
+
+        # Level 1总是向量化
+        if chunk.metadata.level == 1:
+            return True
+
+        # Level 2根据类型和大小决定
+        if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
+            return False
+
+        # 太小的块（<5行）不向量化
+        lines = chunk.metadata.end_line - chunk.metadata.start_line
+        if lines < 5:
+            return False
+
+        return True
+```
+
+## 10. 实施路线图
+
+### Phase 1: 基础架构（2-3周）
+- [x] 设计数据结构（HierarchicalChunk, ChunkMetadata）
+- [ ] 实现MacroChunker（复用现有code_extractor）
+- [ ] 实现基础的MicroChunker
+- [ ] 数据库schema设计和migration
+- [ ] 单元测试
+
+### Phase 2: LLM集成（1-2周）
+- [ ] 实现HierarchicalLLMEnhancer
+- [ ] 设计分层prompt模板
+- [ ] 批量处理优化
+- [ ] 集成测试
+
+### Phase 3: 向量化与检索（1-2周）
+- [ ] 实现HierarchicalVectorStore
+- [ ] 实现ContextualSearchEngine
+- [ ] 上下文扩展逻辑
+- [ ] 检索性能测试
+
+### Phase 4: 优化与完善（2周）
+- [ ] 性能优化（批量处理、增量更新）
+- [ ] 降级策略完善
+- [ ] 选择性向量化
+- [ ] 全面测试和文档
+
+### Phase 5: 生产部署（1周）
+- [ ] CLI集成
+- [ ] 配置选项暴露
+- [ ] 生产环境测试
+- [ ] 发布
+
+**总计预估时间**：7-10周
+
+## 11. 成功指标
+
+1. **覆盖率**：95%以上的代码能被正确分词
+2. **准确率**：层级关系准确率>98%
+3. **检索质量**：相比单层分词，检索相关性提升30%+
+4. **性能**：单文件分词<100ms，批量处理>100文件/分钟
+5. **存储效率**：相比全向量化，空间占用减少40%+
+
+## 12. 参考资料
+
+- [Tree-sitter Documentation](https://tree-sitter.github.io/)
+- [AST-based Code Analysis](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
+- [Hierarchical Text Segmentation](https://arxiv.org/abs/2104.08836)
+- 现有代码：`src/codexlens/semantic/chunker.py`