Files
Claude-Code-Workflow/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
catlog22 3ffb907a6f feat: add semantic graph design for static code analysis
- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities.
- Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies.
- Included detailed specifications for nodes and edges in the graph, along with database schema for storage.
- Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
2025-12-15 09:47:18 +08:00

28 KiB
Raw Permalink Blame History

多层次分词器设计方案

1. 背景与目标

1.1 当前问题

当前 chunker.py 的两种分词策略存在明显缺陷:

symbol-based 策略

  • 优点保持代码逻辑完整性每个chunk是完整的函数/类
  • 缺点粒度不均超大函数可能达到数百行影响LLM处理和搜索精度

sliding-window 策略

  • 优点chunk大小均匀覆盖全面
  • 缺点:破坏逻辑结构,可能将完整的循环/条件块切断

1.2 设计目标

实现多层次分词器,同时满足:

  1. 语义完整性:保持代码逻辑边界的完整性
  2. 粒度可控:支持从粗粒度(函数级)到细粒度(逻辑块级)的灵活划分
  3. 层级关系保留chunk之间的父子关系支持上下文检索
  4. 高效索引:优化向量化和检索性能

2. 技术架构

2.1 两层分词架构

Source Code
    ↓
[Layer 1: Symbol-Level Chunking]  ← 使用 tree-sitter AST
    ↓
MacroChunks (Functions/Classes)
    ↓
[Layer 2: Logic-Block Chunking]   ← AST深度遍历
    ↓
MicroChunks (Loops/Conditionals/Blocks)
    ↓
Vector Embedding + Indexing

2.2 核心组件

# 新增数据结构
@dataclass
class ChunkMetadata:
    """Chunk元数据"""
    chunk_id: str
    parent_id: Optional[str]  # 父chunk ID
    level: int                 # 层级1=macro, 2=micro
    chunk_type: str           # function/class/loop/conditional/try_except
    file_path: str
    start_line: int
    end_line: int
    symbol_name: Optional[str]
    context_summary: Optional[str]  # 继承自父chunk的上下文

@dataclass
class HierarchicalChunk:
    """层级化的代码块"""
    metadata: ChunkMetadata
    content: str
    embedding: Optional[List[float]] = None
    children: List['HierarchicalChunk'] = field(default_factory=list)

3. 详细实现步骤

3.1 第一层符号级分词Macro-Chunking

实现思路:复用现有 code_extractor.py 逻辑,增强元数据提取。

class MacroChunker:
    """第一层分词器:提取顶层符号"""

    def __init__(self):
        self.parser = Parser()
        # 加载语言grammar

    def chunk_by_symbols(
        self,
        content: str,
        file_path: str,
        language: str
    ) -> List[HierarchicalChunk]:
        """提取顶层函数和类定义"""
        tree = self.parser.parse(bytes(content, 'utf-8'))
        root_node = tree.root_node

        chunks = []
        for node in root_node.children:
            if node.type in ['function_definition', 'class_definition',
                           'method_definition']:
                chunk = self._create_macro_chunk(node, content, file_path)
                chunks.append(chunk)

        return chunks

    def _create_macro_chunk(
        self,
        node,
        content: str,
        file_path: str
    ) -> HierarchicalChunk:
        """从AST节点创建macro chunk"""
        start_line = node.start_point[0] + 1
        end_line = node.end_point[0] + 1

        # 提取符号名称
        name_node = node.child_by_field_name('name')
        symbol_name = content[name_node.start_byte:name_node.end_byte]

        # 提取完整代码包含docstring和装饰器
        chunk_content = self._extract_with_context(node, content)

        metadata = ChunkMetadata(
            chunk_id=f"{file_path}:{start_line}",
            parent_id=None,
            level=1,
            chunk_type=node.type,
            file_path=file_path,
            start_line=start_line,
            end_line=end_line,
            symbol_name=symbol_name,
        )

        return HierarchicalChunk(
            metadata=metadata,
            content=chunk_content,
        )

    def _extract_with_context(self, node, content: str) -> str:
        """提取代码包含装饰器和docstring"""
        # 向上查找装饰器
        start_byte = node.start_byte
        prev_sibling = node.prev_sibling
        while prev_sibling and prev_sibling.type == 'decorator':
            start_byte = prev_sibling.start_byte
            prev_sibling = prev_sibling.prev_sibling

        return content[start_byte:node.end_byte]

3.2 第二层逻辑块分词Micro-Chunking

实现思路在每个macro chunk内部按逻辑结构进一步划分。

class MicroChunker:
    """第二层分词器:提取逻辑块"""

    # 需要划分的逻辑块类型
    LOGIC_BLOCK_TYPES = {
        'for_statement',
        'while_statement',
        'if_statement',
        'try_statement',
        'with_statement',
    }

    def chunk_logic_blocks(
        self,
        macro_chunk: HierarchicalChunk,
        content: str,
        max_lines: int = 50  # 大于此行数的macro chunk才进行二次划分
    ) -> List[HierarchicalChunk]:
        """在macro chunk内部提取逻辑块"""

        # 小函数不需要二次划分
        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
        if total_lines <= max_lines:
            return []

        tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
        root_node = tree.root_node

        micro_chunks = []
        self._traverse_logic_blocks(
            root_node,
            macro_chunk,
            content,
            micro_chunks
        )

        return micro_chunks

    def _traverse_logic_blocks(
        self,
        node,
        parent_chunk: HierarchicalChunk,
        content: str,
        result: List[HierarchicalChunk]
    ):
        """递归遍历AST提取逻辑块"""

        if node.type in self.LOGIC_BLOCK_TYPES:
            micro_chunk = self._create_micro_chunk(
                node,
                parent_chunk,
                content
            )
            result.append(micro_chunk)
            parent_chunk.children.append(micro_chunk)

        # 继续遍历子节点
        for child in node.children:
            self._traverse_logic_blocks(child, parent_chunk, content, result)

    def _create_micro_chunk(
        self,
        node,
        parent_chunk: HierarchicalChunk,
        content: str
    ) -> HierarchicalChunk:
        """创建micro chunk"""

        # 计算相对于文件的行号
        start_line = parent_chunk.metadata.start_line + node.start_point[0]
        end_line = parent_chunk.metadata.start_line + node.end_point[0]

        chunk_content = content[node.start_byte:node.end_byte]

        metadata = ChunkMetadata(
            chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
            parent_id=parent_chunk.metadata.chunk_id,
            level=2,
            chunk_type=node.type,
            file_path=parent_chunk.metadata.file_path,
            start_line=start_line,
            end_line=end_line,
            symbol_name=parent_chunk.metadata.symbol_name,  # 继承父符号名
            context_summary=None,  # 后续由LLM填充
        )

        return HierarchicalChunk(
            metadata=metadata,
            content=chunk_content,
        )

3.3 统一接口:多层次分词器

class HierarchicalChunker:
    """多层次分词器统一接口"""

    def __init__(self, config: ChunkConfig = None):
        self.config = config or ChunkConfig()
        self.macro_chunker = MacroChunker()
        self.micro_chunker = MicroChunker()

    def chunk_file(
        self,
        content: str,
        file_path: str,
        language: str
    ) -> List[HierarchicalChunk]:
        """对文件进行多层次分词"""

        # 第一层:符号级分词
        macro_chunks = self.macro_chunker.chunk_by_symbols(
            content, file_path, language
        )

        # 第二层:逻辑块分词
        all_chunks = []
        for macro_chunk in macro_chunks:
            all_chunks.append(macro_chunk)

            # 对大函数进行二次划分
            micro_chunks = self.micro_chunker.chunk_logic_blocks(
                macro_chunk, content
            )
            all_chunks.extend(micro_chunks)

        return all_chunks

    def chunk_file_with_fallback(
        self,
        content: str,
        file_path: str,
        language: str
    ) -> List[HierarchicalChunk]:
        """带降级策略的分词"""

        try:
            return self.chunk_file(content, file_path, language)
        except Exception as e:
            logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
            # 降级到滑动窗口策略
            return self._fallback_sliding_window(content, file_path, language)

4. 数据存储设计

4.1 数据库Schema

-- chunk表存储所有层级的chunk
CREATE TABLE chunks (
    chunk_id TEXT PRIMARY KEY,
    parent_id TEXT,           -- 父chunk IDNULL表示顶层
    level INTEGER NOT NULL,   -- 1=macro, 2=micro
    chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
    file_path TEXT NOT NULL,
    start_line INTEGER NOT NULL,
    end_line INTEGER NOT NULL,
    symbol_name TEXT,
    content TEXT NOT NULL,
    content_hash TEXT,        -- 用于检测内容变化

    -- 语义元数据由LLM生成
    summary TEXT,
    keywords TEXT,            -- JSON数组
    purpose TEXT,

    -- 向量嵌入
    embedding BLOB,           -- 存储向量

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
);

-- 索引优化
CREATE INDEX idx_chunks_file_path ON chunks(file_path);
CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
CREATE INDEX idx_chunks_level ON chunks(level);
CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);

4.2 向量索引

使用分层索引策略:

class HierarchicalVectorStore:
    """层级化向量存储"""

    def __init__(self, db_path: Path):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path)

    def add_chunk(self, chunk: HierarchicalChunk):
        """添加chunk及其向量"""

        cursor = self.conn.cursor()
        cursor.execute("""
            INSERT INTO chunks (
                chunk_id, parent_id, level, chunk_type,
                file_path, start_line, end_line, symbol_name,
                content, embedding
            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            chunk.metadata.chunk_id,
            chunk.metadata.parent_id,
            chunk.metadata.level,
            chunk.metadata.chunk_type,
            chunk.metadata.file_path,
            chunk.metadata.start_line,
            chunk.metadata.end_line,
            chunk.metadata.symbol_name,
            chunk.content,
            self._serialize_embedding(chunk.embedding),
        ))

        self.conn.commit()

    def search_hierarchical(
        self,
        query_embedding: List[float],
        top_k: int = 10,
        level_weights: Dict[int, float] = None
    ) -> List[Tuple[HierarchicalChunk, float]]:
        """层级化检索"""

        # 默认权重macro chunk权重更高
        if level_weights is None:
            level_weights = {1: 1.0, 2: 0.8}

        # 检索所有chunk
        cursor = self.conn.cursor()
        cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")

        results = []
        for row in cursor.fetchall():
            chunk = self._row_to_chunk(row)
            similarity = self._cosine_similarity(
                query_embedding,
                chunk.embedding
            )

            # 根据层级应用权重
            weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
            results.append((chunk, weighted_score))

        # 按分数排序
        results.sort(key=lambda x: x[1], reverse=True)
        return results[:top_k]

    def get_chunk_with_context(
        self,
        chunk_id: str
    ) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
        """获取chunk及其父chunk提供上下文"""

        cursor = self.conn.cursor()

        # 获取chunk本身
        cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
        chunk_row = cursor.fetchone()
        chunk = self._row_to_chunk(chunk_row)

        # 获取父chunk
        parent = None
        if chunk.metadata.parent_id:
            cursor.execute(
                "SELECT * FROM chunks WHERE chunk_id = ?",
                (chunk.metadata.parent_id,)
            )
            parent_row = cursor.fetchone()
            if parent_row:
                parent = self._row_to_chunk(parent_row)

        return chunk, parent

5. LLM集成策略

5.1 分层生成语义元数据

class HierarchicalLLMEnhancer:
    """为层级chunk生成语义元数据"""

    def enhance_hierarchical_chunks(
        self,
        chunks: List[HierarchicalChunk]
    ) -> Dict[str, SemanticMetadata]:
        """
        分层处理策略:
        1. 先处理所有level=1的macro chunks生成详细摘要
        2. 再处理level=2的micro chunks使用父chunk摘要作为上下文
        """

        results = {}

        # 第一轮处理macro chunks
        macro_chunks = [c for c in chunks if c.metadata.level == 1]
        macro_metadata = self.llm_enhancer.enhance_files([
            FileData(
                path=c.metadata.chunk_id,
                content=c.content,
                language=self._detect_language(c.metadata.file_path)
            )
            for c in macro_chunks
        ])
        results.update(macro_metadata)

        # 第二轮处理micro chunks带父上下文
        micro_chunks = [c for c in chunks if c.metadata.level == 2]
        for micro_chunk in micro_chunks:
            parent_id = micro_chunk.metadata.parent_id
            parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')

            # 构建带上下文的prompt
            enhanced_prompt = f"""
Parent Function: {micro_chunk.metadata.symbol_name}
Parent Summary: {parent_summary}

Code Block ({micro_chunk.metadata.chunk_type}):

{micro_chunk.content}


Generate a concise summary (1 sentence) and keywords for this specific code block.
"""

            metadata = self._call_llm_with_context(enhanced_prompt)
            results[micro_chunk.metadata.chunk_id] = metadata

        return results

5.2 Prompt优化

针对不同层级使用不同的prompt模板

Macro Chunk Prompt (Level 1):

PURPOSE: Generate comprehensive semantic metadata for a complete function/class
TASK:
- Provide a detailed summary (2-3 sentences) covering what the code does and why
- Extract 8-12 relevant keywords including technical terms and domain concepts
- Identify the primary purpose/category
MODE: analysis

CODE:
```{language}
{content}

OUTPUT: JSON with summary, keywords, purpose


**Micro Chunk Prompt (Level 2)**:

PURPOSE: Summarize a specific logic block within a larger function CONTEXT:

  • Parent Function: {symbol_name}
  • Parent Purpose: {parent_summary}

TASK:

  • Provide a brief summary (1 sentence) of this specific block's role in the parent function
  • Extract 3-5 keywords specific to this block's logic MODE: analysis

CODE BLOCK ({chunk_type}):

{content}

OUTPUT: JSON with summary, keywords


## 6. 检索增强

### 6.1 上下文扩展检索

```python
class ContextualSearchEngine:
    """支持上下文扩展的检索引擎"""

    def search_with_context(
        self,
        query: str,
        top_k: int = 10,
        expand_context: bool = True
    ) -> List[SearchResult]:
        """
        检索并自动扩展上下文

        如果匹配到micro chunk自动返回其父macro chunk作为上下文
        """

        # 生成查询向量
        query_embedding = self.embedder.embed_single(query)

        # 层级化检索
        raw_results = self.vector_store.search_hierarchical(
            query_embedding,
            top_k=top_k
        )

        # 扩展上下文
        enriched_results = []
        for chunk, score in raw_results:
            result = SearchResult(
                path=chunk.metadata.file_path,
                score=score,
                content=chunk.content,
                start_line=chunk.metadata.start_line,
                end_line=chunk.metadata.end_line,
                symbol_name=chunk.metadata.symbol_name,
            )

            # 如果是micro chunk获取父chunk作为上下文
            if expand_context and chunk.metadata.level == 2:
                parent_chunk, _ = self.vector_store.get_chunk_with_context(
                    chunk.metadata.chunk_id
                )
                if parent_chunk:
                    result.metadata['parent_context'] = {
                        'summary': parent_chunk.metadata.context_summary,
                        'symbol_name': parent_chunk.metadata.symbol_name,
                        'content': parent_chunk.content,
                    }

            enriched_results.append(result)

        return enriched_results

7. 测试策略

7.1 单元测试

import pytest
from codexlens.semantic.hierarchical_chunker import (
    HierarchicalChunker, MacroChunker, MicroChunker
)

class TestMacroChunker:
    """测试第一层分词"""

    def test_extract_functions(self):
        """测试提取函数定义"""
        code = '''
def calculate_total(items):
    """Calculate total price."""
    total = 0
    for item in items:
        total += item.price
    return total

def apply_discount(total, discount):
    """Apply discount to total."""
    return total * (1 - discount)
'''
        chunker = MacroChunker()
        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')

        assert len(chunks) == 2
        assert chunks[0].metadata.symbol_name == 'calculate_total'
        assert chunks[1].metadata.symbol_name == 'apply_discount'
        assert chunks[0].metadata.level == 1

    def test_extract_with_decorators(self):
        """测试提取带装饰器的函数"""
        code = '''
@app.route('/api/users')
@auth_required
def get_users():
    return User.query.all()
'''
        chunker = MacroChunker()
        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')

        assert len(chunks) == 1
        assert '@app.route' in chunks[0].content
        assert '@auth_required' in chunks[0].content

class TestMicroChunker:
    """测试第二层分词"""

    def test_extract_loop_blocks(self):
        """测试提取循环块"""
        code = '''
def process_items(items):
    results = []
    for item in items:
        if item.active:
            results.append(process(item))
    return results
'''
        macro_chunker = MacroChunker()
        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')

        micro_chunker = MicroChunker()
        micro_chunks = micro_chunker.chunk_logic_blocks(
            macro_chunks[0], code
        )

        # 应该提取出for循环和if条件块
        assert len(micro_chunks) >= 1
        assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)

    def test_skip_small_functions(self):
        """测试小函数跳过二次划分"""
        code = '''
def small_func(x):
    return x * 2
'''
        macro_chunker = MacroChunker()
        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')

        micro_chunker = MicroChunker()
        micro_chunks = micro_chunker.chunk_logic_blocks(
            macro_chunks[0], code, max_lines=10
        )

        # 小函数不应该被二次划分
        assert len(micro_chunks) == 0

class TestHierarchicalChunker:
    """测试完整的多层次分词"""

    def test_full_hierarchical_chunking(self):
        """测试完整的层级分词流程"""
        code = '''
def complex_function(data):
    """A complex function with multiple logic blocks."""

    # Validation
    if not data:
        raise ValueError("Data is empty")

    # Processing
    results = []
    for item in data:
        try:
            processed = process_item(item)
            results.append(processed)
        except Exception as e:
            logger.error(f"Failed to process: {e}")
            continue

    # Aggregation
    total = sum(r.value for r in results)
    return total
'''
        chunker = HierarchicalChunker()
        chunks = chunker.chunk_file(code, 'test.py', 'python')

        # 应该有1个macro chunk和多个micro chunks
        macro_chunks = [c for c in chunks if c.metadata.level == 1]
        micro_chunks = [c for c in chunks if c.metadata.level == 2]

        assert len(macro_chunks) == 1
        assert len(micro_chunks) > 0

        # 验证父子关系
        for micro in micro_chunks:
            assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id

7.2 集成测试

class TestHierarchicalIndexing:
    """测试完整的索引流程"""

    def test_index_and_search(self):
        """测试分层索引和检索"""

        # 1. 分词
        chunker = HierarchicalChunker()
        chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')

        # 2. LLM增强
        enhancer = HierarchicalLLMEnhancer()
        metadata = enhancer.enhance_hierarchical_chunks(chunks)

        # 3. 向量化
        embedder = Embedder()
        for chunk in chunks:
            text = metadata[chunk.metadata.chunk_id].summary
            chunk.embedding = embedder.embed_single(text)

        # 4. 存储
        vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
        for chunk in chunks:
            vector_store.add_chunk(chunk)

        # 5. 检索
        search_engine = ContextualSearchEngine(vector_store, embedder)
        results = search_engine.search_with_context(
            "find loop that processes items",
            top_k=5
        )

        # 验证结果
        assert len(results) > 0
        assert any(r.metadata.get('parent_context') for r in results)

8. 性能优化

8.1 批量处理

class BatchHierarchicalProcessor:
    """批量处理多个文件的层级分词"""

    def process_files_batch(
        self,
        file_paths: List[Path],
        batch_size: int = 10
    ):
        """批量处理优化LLM调用"""

        all_chunks = []

        # 1. 批量分词
        for file_path in file_paths:
            content = file_path.read_text()
            chunks = self.chunker.chunk_file(
                content, str(file_path), self._detect_language(file_path)
            )
            all_chunks.extend(chunks)

        # 2. 批量LLM增强减少API调用
        macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
        for i in range(0, len(macro_chunks), batch_size):
            batch = macro_chunks[i:i+batch_size]
            self.enhancer.enhance_batch(batch)

        # 3. 批量向量化
        all_texts = [c.content for c in all_chunks]
        embeddings = self.embedder.embed_batch(all_texts)
        for chunk, embedding in zip(all_chunks, embeddings):
            chunk.embedding = embedding

        # 4. 批量存储
        self.vector_store.add_chunks_batch(all_chunks)

8.2 增量更新

class IncrementalIndexer:
    """增量索引器:只处理变化的文件"""

    def update_file(self, file_path: Path):
        """增量更新单个文件"""

        content = file_path.read_text()
        content_hash = hashlib.sha256(content.encode()).hexdigest()

        # 检查文件是否变化
        cursor = self.conn.cursor()
        cursor.execute("""
            SELECT content_hash FROM chunks
            WHERE file_path = ? AND level = 1
            LIMIT 1
        """, (str(file_path),))

        row = cursor.fetchone()
        if row and row[0] == content_hash:
            logger.info(f"File {file_path} unchanged, skipping")
            return

        # 删除旧chunk
        cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))

        # 重新索引
        chunks = self.chunker.chunk_file(content, str(file_path), 'python')
        # ... 继续处理

9. 潜在问题与解决方案

9.1 问题超大函数的micro chunk过多

现象某些遗留代码函数超过1000行可能产生几十个micro chunks。

解决方案

class AdaptiveMicroChunker:
    """自适应micro分词根据函数大小调整策略"""

    def chunk_logic_blocks(self, macro_chunk, content):
        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line

        if total_lines > 500:
            # 超大函数:只提取顶层逻辑块,不递归
            return self._extract_top_level_blocks(macro_chunk, content)
        elif total_lines > 100:
            # 大函数递归深度限制为2层
            return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
        else:
            # 正常函数完全跳过micro chunking
            return []

9.2 问题tree-sitter解析失败

现象对于语法错误的代码tree-sitter解析可能失败。

解决方案

def chunk_file_with_fallback(self, content, file_path, language):
    """带降级策略的分词"""

    try:
        # 尝试层级分词
        return self.chunk_file(content, file_path, language)
    except TreeSitterError as e:
        logger.warning(f"Tree-sitter parsing failed: {e}")

        # 降级到基于正则的简单symbol提取
        return self._fallback_regex_chunking(content, file_path)
    except Exception as e:
        logger.error(f"Chunking failed completely: {e}")

        # 最终降级到滑动窗口
        return self._fallback_sliding_window(content, file_path, language)

9.3 问题:向量存储空间占用

现象每个chunk都存储向量空间占用可能很大。

解决方案

  • 选择性向量化只对macro chunks和重要的micro chunks生成向量
  • 向量压缩使用PCA或量化技术减少向量维度
  • 分离存储向量存储在专门的向量数据库如FaissSQLite只存元数据
class SelectiveVectorization:
    """选择性向量化:减少存储开销"""

    VECTORIZE_CHUNK_TYPES = {
        'function_definition',   # 总是向量化
        'class_definition',      # 总是向量化
        'for_statement',         # 循环块
        'try_statement',         # 异常处理
        # 'if_statement' 通常不单独向量化依赖父chunk
    }

    def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
        """判断是否需要为chunk生成向量"""

        # Level 1总是向量化
        if chunk.metadata.level == 1:
            return True

        # Level 2根据类型和大小决定
        if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
            return False

        # 太小的块(<5行不向量化
        lines = chunk.metadata.end_line - chunk.metadata.start_line
        if lines < 5:
            return False

        return True

10. 实施路线图

Phase 1: 基础架构2-3周

  • 设计数据结构HierarchicalChunk, ChunkMetadata
  • 实现MacroChunker复用现有code_extractor
  • 实现基础的MicroChunker
  • 数据库schema设计和migration
  • 单元测试

Phase 2: LLM集成1-2周

  • 实现HierarchicalLLMEnhancer
  • 设计分层prompt模板
  • 批量处理优化
  • 集成测试

Phase 3: 向量化与检索1-2周

  • 实现HierarchicalVectorStore
  • 实现ContextualSearchEngine
  • 上下文扩展逻辑
  • 检索性能测试

Phase 4: 优化与完善2周

  • 性能优化(批量处理、增量更新)
  • 降级策略完善
  • 选择性向量化
  • 全面测试和文档

Phase 5: 生产部署1周

  • CLI集成
  • 配置选项暴露
  • 生产环境测试
  • 发布

总计预估时间7-10周

11. 成功指标

  1. 覆盖率95%以上的代码能被正确分词
  2. 准确率:层级关系准确率>98%
  3. 检索质量相比单层分词检索相关性提升30%+
  4. 性能:单文件分词<100ms批量处理>100文件/分钟
  5. 存储效率相比全向量化空间占用减少40%+

12. 参考资料