Files
Claude-Code-Workflow/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
catlog22 3ffb907a6f feat: add semantic graph design for static code analysis
- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities.
- Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies.
- Included detailed specifications for nodes and edges in the graph, along with database schema for storage.
- Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
2025-12-15 09:47:18 +08:00

974 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 多层次分词器设计方案
## 1. 背景与目标
### 1.1 当前问题
当前 `chunker.py` 的两种分词策略存在明显缺陷:
**symbol-based 策略**
- ✅ 优点保持代码逻辑完整性每个chunk是完整的函数/类
- ❌ 缺点粒度不均超大函数可能达到数百行影响LLM处理和搜索精度
**sliding-window 策略**
- ✅ 优点chunk大小均匀覆盖全面
- ❌ 缺点:破坏逻辑结构,可能将完整的循环/条件块切断
### 1.2 设计目标
实现多层次分词器,同时满足:
1. **语义完整性**:保持代码逻辑边界的完整性
2. **粒度可控**:支持从粗粒度(函数级)到细粒度(逻辑块级)的灵活划分
3. **层级关系**保留chunk之间的父子关系支持上下文检索
4. **高效索引**:优化向量化和检索性能
## 2. 技术架构
### 2.1 两层分词架构
```
Source Code
[Layer 1: Symbol-Level Chunking] ← 使用 tree-sitter AST
MacroChunks (Functions/Classes)
[Layer 2: Logic-Block Chunking] ← AST深度遍历
MicroChunks (Loops/Conditionals/Blocks)
Vector Embedding + Indexing
```
### 2.2 核心组件
```python
# 新增数据结构
@dataclass
class ChunkMetadata:
"""Chunk元数据"""
chunk_id: str
parent_id: Optional[str] # 父chunk ID
level: int # 层级1=macro, 2=micro
chunk_type: str # function/class/loop/conditional/try_except
file_path: str
start_line: int
end_line: int
symbol_name: Optional[str]
context_summary: Optional[str] # 继承自父chunk的上下文
@dataclass
class HierarchicalChunk:
"""层级化的代码块"""
metadata: ChunkMetadata
content: str
embedding: Optional[List[float]] = None
children: List['HierarchicalChunk'] = field(default_factory=list)
```
## 3. 详细实现步骤
### 3.1 第一层符号级分词Macro-Chunking
**实现思路**:复用现有 `code_extractor.py` 逻辑,增强元数据提取。
```python
class MacroChunker:
"""第一层分词器:提取顶层符号"""
def __init__(self):
self.parser = Parser()
# 加载语言grammar
def chunk_by_symbols(
self,
content: str,
file_path: str,
language: str
) -> List[HierarchicalChunk]:
"""提取顶层函数和类定义"""
tree = self.parser.parse(bytes(content, 'utf-8'))
root_node = tree.root_node
chunks = []
for node in root_node.children:
if node.type in ['function_definition', 'class_definition',
'method_definition']:
chunk = self._create_macro_chunk(node, content, file_path)
chunks.append(chunk)
return chunks
def _create_macro_chunk(
self,
node,
content: str,
file_path: str
) -> HierarchicalChunk:
"""从AST节点创建macro chunk"""
start_line = node.start_point[0] + 1
end_line = node.end_point[0] + 1
# 提取符号名称
name_node = node.child_by_field_name('name')
symbol_name = content[name_node.start_byte:name_node.end_byte]
# 提取完整代码包含docstring和装饰器
chunk_content = self._extract_with_context(node, content)
metadata = ChunkMetadata(
chunk_id=f"{file_path}:{start_line}",
parent_id=None,
level=1,
chunk_type=node.type,
file_path=file_path,
start_line=start_line,
end_line=end_line,
symbol_name=symbol_name,
)
return HierarchicalChunk(
metadata=metadata,
content=chunk_content,
)
def _extract_with_context(self, node, content: str) -> str:
"""提取代码包含装饰器和docstring"""
# 向上查找装饰器
start_byte = node.start_byte
prev_sibling = node.prev_sibling
while prev_sibling and prev_sibling.type == 'decorator':
start_byte = prev_sibling.start_byte
prev_sibling = prev_sibling.prev_sibling
return content[start_byte:node.end_byte]
```
### 3.2 第二层逻辑块分词Micro-Chunking
**实现思路**在每个macro chunk内部按逻辑结构进一步划分。
```python
class MicroChunker:
"""第二层分词器:提取逻辑块"""
# 需要划分的逻辑块类型
LOGIC_BLOCK_TYPES = {
'for_statement',
'while_statement',
'if_statement',
'try_statement',
'with_statement',
}
def chunk_logic_blocks(
self,
macro_chunk: HierarchicalChunk,
content: str,
max_lines: int = 50 # 大于此行数的macro chunk才进行二次划分
) -> List[HierarchicalChunk]:
"""在macro chunk内部提取逻辑块"""
# 小函数不需要二次划分
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
if total_lines <= max_lines:
return []
tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
root_node = tree.root_node
micro_chunks = []
self._traverse_logic_blocks(
root_node,
macro_chunk,
content,
micro_chunks
)
return micro_chunks
def _traverse_logic_blocks(
self,
node,
parent_chunk: HierarchicalChunk,
content: str,
result: List[HierarchicalChunk]
):
"""递归遍历AST提取逻辑块"""
if node.type in self.LOGIC_BLOCK_TYPES:
micro_chunk = self._create_micro_chunk(
node,
parent_chunk,
content
)
result.append(micro_chunk)
parent_chunk.children.append(micro_chunk)
# 继续遍历子节点
for child in node.children:
self._traverse_logic_blocks(child, parent_chunk, content, result)
def _create_micro_chunk(
self,
node,
parent_chunk: HierarchicalChunk,
content: str
) -> HierarchicalChunk:
"""创建micro chunk"""
# 计算相对于文件的行号
start_line = parent_chunk.metadata.start_line + node.start_point[0]
end_line = parent_chunk.metadata.start_line + node.end_point[0]
chunk_content = content[node.start_byte:node.end_byte]
metadata = ChunkMetadata(
chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
parent_id=parent_chunk.metadata.chunk_id,
level=2,
chunk_type=node.type,
file_path=parent_chunk.metadata.file_path,
start_line=start_line,
end_line=end_line,
symbol_name=parent_chunk.metadata.symbol_name, # 继承父符号名
context_summary=None, # 后续由LLM填充
)
return HierarchicalChunk(
metadata=metadata,
content=chunk_content,
)
```
### 3.3 统一接口:多层次分词器
```python
class HierarchicalChunker:
"""多层次分词器统一接口"""
def __init__(self, config: ChunkConfig = None):
self.config = config or ChunkConfig()
self.macro_chunker = MacroChunker()
self.micro_chunker = MicroChunker()
def chunk_file(
self,
content: str,
file_path: str,
language: str
) -> List[HierarchicalChunk]:
"""对文件进行多层次分词"""
# 第一层:符号级分词
macro_chunks = self.macro_chunker.chunk_by_symbols(
content, file_path, language
)
# 第二层:逻辑块分词
all_chunks = []
for macro_chunk in macro_chunks:
all_chunks.append(macro_chunk)
# 对大函数进行二次划分
micro_chunks = self.micro_chunker.chunk_logic_blocks(
macro_chunk, content
)
all_chunks.extend(micro_chunks)
return all_chunks
def chunk_file_with_fallback(
self,
content: str,
file_path: str,
language: str
) -> List[HierarchicalChunk]:
"""带降级策略的分词"""
try:
return self.chunk_file(content, file_path, language)
except Exception as e:
logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
# 降级到滑动窗口策略
return self._fallback_sliding_window(content, file_path, language)
```
## 4. 数据存储设计
### 4.1 数据库Schema
```sql
-- chunk表存储所有层级的chunk
CREATE TABLE chunks (
chunk_id TEXT PRIMARY KEY,
parent_id TEXT, -- 父chunk IDNULL表示顶层
level INTEGER NOT NULL, -- 1=macro, 2=micro
chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
file_path TEXT NOT NULL,
start_line INTEGER NOT NULL,
end_line INTEGER NOT NULL,
symbol_name TEXT,
content TEXT NOT NULL,
content_hash TEXT, -- 用于检测内容变化
-- 语义元数据由LLM生成
summary TEXT,
keywords TEXT, -- JSON数组
purpose TEXT,
-- 向量嵌入
embedding BLOB, -- 存储向量
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
);
-- 索引优化
CREATE INDEX idx_chunks_file_path ON chunks(file_path);
CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
CREATE INDEX idx_chunks_level ON chunks(level);
CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);
```
### 4.2 向量索引
使用分层索引策略:
```python
class HierarchicalVectorStore:
"""层级化向量存储"""
def __init__(self, db_path: Path):
self.db_path = db_path
self.conn = sqlite3.connect(db_path)
def add_chunk(self, chunk: HierarchicalChunk):
"""添加chunk及其向量"""
cursor = self.conn.cursor()
cursor.execute("""
INSERT INTO chunks (
chunk_id, parent_id, level, chunk_type,
file_path, start_line, end_line, symbol_name,
content, embedding
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
chunk.metadata.chunk_id,
chunk.metadata.parent_id,
chunk.metadata.level,
chunk.metadata.chunk_type,
chunk.metadata.file_path,
chunk.metadata.start_line,
chunk.metadata.end_line,
chunk.metadata.symbol_name,
chunk.content,
self._serialize_embedding(chunk.embedding),
))
self.conn.commit()
def search_hierarchical(
self,
query_embedding: List[float],
top_k: int = 10,
level_weights: Dict[int, float] = None
) -> List[Tuple[HierarchicalChunk, float]]:
"""层级化检索"""
# 默认权重macro chunk权重更高
if level_weights is None:
level_weights = {1: 1.0, 2: 0.8}
# 检索所有chunk
cursor = self.conn.cursor()
cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")
results = []
for row in cursor.fetchall():
chunk = self._row_to_chunk(row)
similarity = self._cosine_similarity(
query_embedding,
chunk.embedding
)
# 根据层级应用权重
weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
results.append((chunk, weighted_score))
# 按分数排序
results.sort(key=lambda x: x[1], reverse=True)
return results[:top_k]
def get_chunk_with_context(
self,
chunk_id: str
) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
"""获取chunk及其父chunk提供上下文"""
cursor = self.conn.cursor()
# 获取chunk本身
cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
chunk_row = cursor.fetchone()
chunk = self._row_to_chunk(chunk_row)
# 获取父chunk
parent = None
if chunk.metadata.parent_id:
cursor.execute(
"SELECT * FROM chunks WHERE chunk_id = ?",
(chunk.metadata.parent_id,)
)
parent_row = cursor.fetchone()
if parent_row:
parent = self._row_to_chunk(parent_row)
return chunk, parent
```
## 5. LLM集成策略
### 5.1 分层生成语义元数据
```python
class HierarchicalLLMEnhancer:
"""为层级chunk生成语义元数据"""
def enhance_hierarchical_chunks(
self,
chunks: List[HierarchicalChunk]
) -> Dict[str, SemanticMetadata]:
"""
分层处理策略:
1. 先处理所有level=1的macro chunks生成详细摘要
2. 再处理level=2的micro chunks使用父chunk摘要作为上下文
"""
results = {}
# 第一轮处理macro chunks
macro_chunks = [c for c in chunks if c.metadata.level == 1]
macro_metadata = self.llm_enhancer.enhance_files([
FileData(
path=c.metadata.chunk_id,
content=c.content,
language=self._detect_language(c.metadata.file_path)
)
for c in macro_chunks
])
results.update(macro_metadata)
# 第二轮处理micro chunks带父上下文
micro_chunks = [c for c in chunks if c.metadata.level == 2]
for micro_chunk in micro_chunks:
parent_id = micro_chunk.metadata.parent_id
parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')
# 构建带上下文的prompt
enhanced_prompt = f"""
Parent Function: {micro_chunk.metadata.symbol_name}
Parent Summary: {parent_summary}
Code Block ({micro_chunk.metadata.chunk_type}):
```
{micro_chunk.content}
```
Generate a concise summary (1 sentence) and keywords for this specific code block.
"""
metadata = self._call_llm_with_context(enhanced_prompt)
results[micro_chunk.metadata.chunk_id] = metadata
return results
```
### 5.2 Prompt优化
针对不同层级使用不同的prompt模板
**Macro Chunk Prompt (Level 1)**:
```
PURPOSE: Generate comprehensive semantic metadata for a complete function/class
TASK:
- Provide a detailed summary (2-3 sentences) covering what the code does and why
- Extract 8-12 relevant keywords including technical terms and domain concepts
- Identify the primary purpose/category
MODE: analysis
CODE:
```{language}
{content}
```
OUTPUT: JSON with summary, keywords, purpose
```
**Micro Chunk Prompt (Level 2)**:
```
PURPOSE: Summarize a specific logic block within a larger function
CONTEXT:
- Parent Function: {symbol_name}
- Parent Purpose: {parent_summary}
TASK:
- Provide a brief summary (1 sentence) of this specific block's role in the parent function
- Extract 3-5 keywords specific to this block's logic
MODE: analysis
CODE BLOCK ({chunk_type}):
```{language}
{content}
```
OUTPUT: JSON with summary, keywords
```
## 6. 检索增强
### 6.1 上下文扩展检索
```python
class ContextualSearchEngine:
"""支持上下文扩展的检索引擎"""
def search_with_context(
self,
query: str,
top_k: int = 10,
expand_context: bool = True
) -> List[SearchResult]:
"""
检索并自动扩展上下文
如果匹配到micro chunk自动返回其父macro chunk作为上下文
"""
# 生成查询向量
query_embedding = self.embedder.embed_single(query)
# 层级化检索
raw_results = self.vector_store.search_hierarchical(
query_embedding,
top_k=top_k
)
# 扩展上下文
enriched_results = []
for chunk, score in raw_results:
result = SearchResult(
path=chunk.metadata.file_path,
score=score,
content=chunk.content,
start_line=chunk.metadata.start_line,
end_line=chunk.metadata.end_line,
symbol_name=chunk.metadata.symbol_name,
)
# 如果是micro chunk获取父chunk作为上下文
if expand_context and chunk.metadata.level == 2:
parent_chunk, _ = self.vector_store.get_chunk_with_context(
chunk.metadata.chunk_id
)
if parent_chunk:
result.metadata['parent_context'] = {
'summary': parent_chunk.metadata.context_summary,
'symbol_name': parent_chunk.metadata.symbol_name,
'content': parent_chunk.content,
}
enriched_results.append(result)
return enriched_results
```
## 7. 测试策略
### 7.1 单元测试
```python
import pytest
from codexlens.semantic.hierarchical_chunker import (
HierarchicalChunker, MacroChunker, MicroChunker
)
class TestMacroChunker:
"""测试第一层分词"""
def test_extract_functions(self):
"""测试提取函数定义"""
code = '''
def calculate_total(items):
"""Calculate total price."""
total = 0
for item in items:
total += item.price
return total
def apply_discount(total, discount):
"""Apply discount to total."""
return total * (1 - discount)
'''
chunker = MacroChunker()
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
assert len(chunks) == 2
assert chunks[0].metadata.symbol_name == 'calculate_total'
assert chunks[1].metadata.symbol_name == 'apply_discount'
assert chunks[0].metadata.level == 1
def test_extract_with_decorators(self):
"""测试提取带装饰器的函数"""
code = '''
@app.route('/api/users')
@auth_required
def get_users():
return User.query.all()
'''
chunker = MacroChunker()
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
assert len(chunks) == 1
assert '@app.route' in chunks[0].content
assert '@auth_required' in chunks[0].content
class TestMicroChunker:
"""测试第二层分词"""
def test_extract_loop_blocks(self):
"""测试提取循环块"""
code = '''
def process_items(items):
results = []
for item in items:
if item.active:
results.append(process(item))
return results
'''
macro_chunker = MacroChunker()
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
micro_chunker = MicroChunker()
micro_chunks = micro_chunker.chunk_logic_blocks(
macro_chunks[0], code
)
# 应该提取出for循环和if条件块
assert len(micro_chunks) >= 1
assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)
def test_skip_small_functions(self):
"""测试小函数跳过二次划分"""
code = '''
def small_func(x):
return x * 2
'''
macro_chunker = MacroChunker()
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
micro_chunker = MicroChunker()
micro_chunks = micro_chunker.chunk_logic_blocks(
macro_chunks[0], code, max_lines=10
)
# 小函数不应该被二次划分
assert len(micro_chunks) == 0
class TestHierarchicalChunker:
"""测试完整的多层次分词"""
def test_full_hierarchical_chunking(self):
"""测试完整的层级分词流程"""
code = '''
def complex_function(data):
"""A complex function with multiple logic blocks."""
# Validation
if not data:
raise ValueError("Data is empty")
# Processing
results = []
for item in data:
try:
processed = process_item(item)
results.append(processed)
except Exception as e:
logger.error(f"Failed to process: {e}")
continue
# Aggregation
total = sum(r.value for r in results)
return total
'''
chunker = HierarchicalChunker()
chunks = chunker.chunk_file(code, 'test.py', 'python')
# 应该有1个macro chunk和多个micro chunks
macro_chunks = [c for c in chunks if c.metadata.level == 1]
micro_chunks = [c for c in chunks if c.metadata.level == 2]
assert len(macro_chunks) == 1
assert len(micro_chunks) > 0
# 验证父子关系
for micro in micro_chunks:
assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id
```
### 7.2 集成测试
```python
class TestHierarchicalIndexing:
"""测试完整的索引流程"""
def test_index_and_search(self):
"""测试分层索引和检索"""
# 1. 分词
chunker = HierarchicalChunker()
chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')
# 2. LLM增强
enhancer = HierarchicalLLMEnhancer()
metadata = enhancer.enhance_hierarchical_chunks(chunks)
# 3. 向量化
embedder = Embedder()
for chunk in chunks:
text = metadata[chunk.metadata.chunk_id].summary
chunk.embedding = embedder.embed_single(text)
# 4. 存储
vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
for chunk in chunks:
vector_store.add_chunk(chunk)
# 5. 检索
search_engine = ContextualSearchEngine(vector_store, embedder)
results = search_engine.search_with_context(
"find loop that processes items",
top_k=5
)
# 验证结果
assert len(results) > 0
assert any(r.metadata.get('parent_context') for r in results)
```
## 8. 性能优化
### 8.1 批量处理
```python
class BatchHierarchicalProcessor:
"""批量处理多个文件的层级分词"""
def process_files_batch(
self,
file_paths: List[Path],
batch_size: int = 10
):
"""批量处理优化LLM调用"""
all_chunks = []
# 1. 批量分词
for file_path in file_paths:
content = file_path.read_text()
chunks = self.chunker.chunk_file(
content, str(file_path), self._detect_language(file_path)
)
all_chunks.extend(chunks)
# 2. 批量LLM增强减少API调用
macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
for i in range(0, len(macro_chunks), batch_size):
batch = macro_chunks[i:i+batch_size]
self.enhancer.enhance_batch(batch)
# 3. 批量向量化
all_texts = [c.content for c in all_chunks]
embeddings = self.embedder.embed_batch(all_texts)
for chunk, embedding in zip(all_chunks, embeddings):
chunk.embedding = embedding
# 4. 批量存储
self.vector_store.add_chunks_batch(all_chunks)
```
### 8.2 增量更新
```python
class IncrementalIndexer:
"""增量索引器:只处理变化的文件"""
def update_file(self, file_path: Path):
"""增量更新单个文件"""
content = file_path.read_text()
content_hash = hashlib.sha256(content.encode()).hexdigest()
# 检查文件是否变化
cursor = self.conn.cursor()
cursor.execute("""
SELECT content_hash FROM chunks
WHERE file_path = ? AND level = 1
LIMIT 1
""", (str(file_path),))
row = cursor.fetchone()
if row and row[0] == content_hash:
logger.info(f"File {file_path} unchanged, skipping")
return
# 删除旧chunk
cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))
# 重新索引
chunks = self.chunker.chunk_file(content, str(file_path), 'python')
# ... 继续处理
```
## 9. 潜在问题与解决方案
### 9.1 问题超大函数的micro chunk过多
**现象**某些遗留代码函数超过1000行可能产生几十个micro chunks。
**解决方案**
```python
class AdaptiveMicroChunker:
"""自适应micro分词根据函数大小调整策略"""
def chunk_logic_blocks(self, macro_chunk, content):
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
if total_lines > 500:
# 超大函数:只提取顶层逻辑块,不递归
return self._extract_top_level_blocks(macro_chunk, content)
elif total_lines > 100:
# 大函数递归深度限制为2层
return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
else:
# 正常函数完全跳过micro chunking
return []
```
### 9.2 问题tree-sitter解析失败
**现象**对于语法错误的代码tree-sitter解析可能失败。
**解决方案**
```python
def chunk_file_with_fallback(self, content, file_path, language):
"""带降级策略的分词"""
try:
# 尝试层级分词
return self.chunk_file(content, file_path, language)
except TreeSitterError as e:
logger.warning(f"Tree-sitter parsing failed: {e}")
# 降级到基于正则的简单symbol提取
return self._fallback_regex_chunking(content, file_path)
except Exception as e:
logger.error(f"Chunking failed completely: {e}")
# 最终降级到滑动窗口
return self._fallback_sliding_window(content, file_path, language)
```
### 9.3 问题:向量存储空间占用
**现象**每个chunk都存储向量空间占用可能很大。
**解决方案**
- **选择性向量化**只对macro chunks和重要的micro chunks生成向量
- **向量压缩**使用PCA或量化技术减少向量维度
- **分离存储**向量存储在专门的向量数据库如FaissSQLite只存元数据
```python
class SelectiveVectorization:
"""选择性向量化:减少存储开销"""
VECTORIZE_CHUNK_TYPES = {
'function_definition', # 总是向量化
'class_definition', # 总是向量化
'for_statement', # 循环块
'try_statement', # 异常处理
# 'if_statement' 通常不单独向量化依赖父chunk
}
def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
"""判断是否需要为chunk生成向量"""
# Level 1总是向量化
if chunk.metadata.level == 1:
return True
# Level 2根据类型和大小决定
if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
return False
# 太小的块(<5行不向量化
lines = chunk.metadata.end_line - chunk.metadata.start_line
if lines < 5:
return False
return True
```
## 10. 实施路线图
### Phase 1: 基础架构2-3周
- [x] 设计数据结构HierarchicalChunk, ChunkMetadata
- [ ] 实现MacroChunker复用现有code_extractor
- [ ] 实现基础的MicroChunker
- [ ] 数据库schema设计和migration
- [ ] 单元测试
### Phase 2: LLM集成1-2周
- [ ] 实现HierarchicalLLMEnhancer
- [ ] 设计分层prompt模板
- [ ] 批量处理优化
- [ ] 集成测试
### Phase 3: 向量化与检索1-2周
- [ ] 实现HierarchicalVectorStore
- [ ] 实现ContextualSearchEngine
- [ ] 上下文扩展逻辑
- [ ] 检索性能测试
### Phase 4: 优化与完善2周
- [ ] 性能优化(批量处理、增量更新)
- [ ] 降级策略完善
- [ ] 选择性向量化
- [ ] 全面测试和文档
### Phase 5: 生产部署1周
- [ ] CLI集成
- [ ] 配置选项暴露
- [ ] 生产环境测试
- [ ] 发布
**总计预估时间**7-10周
## 11. 成功指标
1. **覆盖率**95%以上的代码能被正确分词
2. **准确率**:层级关系准确率>98%
3. **检索质量**相比单层分词检索相关性提升30%+
4. **性能**:单文件分词<100ms批量处理>100文件/分钟
5. **存储效率**相比全向量化空间占用减少40%+
## 12. 参考资料
- [Tree-sitter Documentation](https://tree-sitter.github.io/)
- [AST-based Code Analysis](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
- [Hierarchical Text Segmentation](https://arxiv.org/abs/2104.08836)
- 现有代码:`src/codexlens/semantic/chunker.py`