mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-06 01:54:11 +08:00
- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities. - Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies. - Included detailed specifications for nodes and edges in the graph, along with database schema for storage. - Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
974 lines
28 KiB
Markdown
974 lines
28 KiB
Markdown
# 多层次分词器设计方案
|
||
|
||
## 1. 背景与目标
|
||
|
||
### 1.1 当前问题
|
||
|
||
当前 `chunker.py` 的两种分词策略存在明显缺陷:
|
||
|
||
**symbol-based 策略**:
|
||
- ✅ 优点:保持代码逻辑完整性,每个chunk是完整的函数/类
|
||
- ❌ 缺点:粒度不均,超大函数可能达到数百行,影响LLM处理和搜索精度
|
||
|
||
**sliding-window 策略**:
|
||
- ✅ 优点:chunk大小均匀,覆盖全面
|
||
- ❌ 缺点:破坏逻辑结构,可能将完整的循环/条件块切断
|
||
|
||
### 1.2 设计目标
|
||
|
||
实现多层次分词器,同时满足:
|
||
1. **语义完整性**:保持代码逻辑边界的完整性
|
||
2. **粒度可控**:支持从粗粒度(函数级)到细粒度(逻辑块级)的灵活划分
|
||
3. **层级关系**:保留chunk之间的父子关系,支持上下文检索
|
||
4. **高效索引**:优化向量化和检索性能
|
||
|
||
## 2. 技术架构
|
||
|
||
### 2.1 两层分词架构
|
||
|
||
```
|
||
Source Code
|
||
↓
|
||
[Layer 1: Symbol-Level Chunking] ← 使用 tree-sitter AST
|
||
↓
|
||
MacroChunks (Functions/Classes)
|
||
↓
|
||
[Layer 2: Logic-Block Chunking] ← AST深度遍历
|
||
↓
|
||
MicroChunks (Loops/Conditionals/Blocks)
|
||
↓
|
||
Vector Embedding + Indexing
|
||
```
|
||
|
||
### 2.2 核心组件
|
||
|
||
```python
|
||
# 新增数据结构
|
||
@dataclass
|
||
class ChunkMetadata:
|
||
"""Chunk元数据"""
|
||
chunk_id: str
|
||
parent_id: Optional[str] # 父chunk ID
|
||
level: int # 层级:1=macro, 2=micro
|
||
chunk_type: str # function/class/loop/conditional/try_except
|
||
file_path: str
|
||
start_line: int
|
||
end_line: int
|
||
symbol_name: Optional[str]
|
||
context_summary: Optional[str] # 继承自父chunk的上下文
|
||
|
||
@dataclass
|
||
class HierarchicalChunk:
|
||
"""层级化的代码块"""
|
||
metadata: ChunkMetadata
|
||
content: str
|
||
embedding: Optional[List[float]] = None
|
||
children: List['HierarchicalChunk'] = field(default_factory=list)
|
||
```
|
||
|
||
## 3. 详细实现步骤
|
||
|
||
### 3.1 第一层:符号级分词(Macro-Chunking)
|
||
|
||
**实现思路**:复用现有 `code_extractor.py` 逻辑,增强元数据提取。
|
||
|
||
```python
|
||
class MacroChunker:
|
||
"""第一层分词器:提取顶层符号"""
|
||
|
||
def __init__(self):
|
||
self.parser = Parser()
|
||
# 加载语言grammar
|
||
|
||
def chunk_by_symbols(
|
||
self,
|
||
content: str,
|
||
file_path: str,
|
||
language: str
|
||
) -> List[HierarchicalChunk]:
|
||
"""提取顶层函数和类定义"""
|
||
tree = self.parser.parse(bytes(content, 'utf-8'))
|
||
root_node = tree.root_node
|
||
|
||
chunks = []
|
||
for node in root_node.children:
|
||
if node.type in ['function_definition', 'class_definition',
|
||
'method_definition']:
|
||
chunk = self._create_macro_chunk(node, content, file_path)
|
||
chunks.append(chunk)
|
||
|
||
return chunks
|
||
|
||
def _create_macro_chunk(
|
||
self,
|
||
node,
|
||
content: str,
|
||
file_path: str
|
||
) -> HierarchicalChunk:
|
||
"""从AST节点创建macro chunk"""
|
||
start_line = node.start_point[0] + 1
|
||
end_line = node.end_point[0] + 1
|
||
|
||
# 提取符号名称
|
||
name_node = node.child_by_field_name('name')
|
||
symbol_name = content[name_node.start_byte:name_node.end_byte]
|
||
|
||
# 提取完整代码(包含docstring和装饰器)
|
||
chunk_content = self._extract_with_context(node, content)
|
||
|
||
metadata = ChunkMetadata(
|
||
chunk_id=f"{file_path}:{start_line}",
|
||
parent_id=None,
|
||
level=1,
|
||
chunk_type=node.type,
|
||
file_path=file_path,
|
||
start_line=start_line,
|
||
end_line=end_line,
|
||
symbol_name=symbol_name,
|
||
)
|
||
|
||
return HierarchicalChunk(
|
||
metadata=metadata,
|
||
content=chunk_content,
|
||
)
|
||
|
||
def _extract_with_context(self, node, content: str) -> str:
|
||
"""提取代码,包含装饰器和docstring"""
|
||
# 向上查找装饰器
|
||
start_byte = node.start_byte
|
||
prev_sibling = node.prev_sibling
|
||
while prev_sibling and prev_sibling.type == 'decorator':
|
||
start_byte = prev_sibling.start_byte
|
||
prev_sibling = prev_sibling.prev_sibling
|
||
|
||
return content[start_byte:node.end_byte]
|
||
```
|
||
|
||
### 3.2 第二层:逻辑块分词(Micro-Chunking)
|
||
|
||
**实现思路**:在每个macro chunk内部,按逻辑结构进一步划分。
|
||
|
||
```python
|
||
class MicroChunker:
|
||
"""第二层分词器:提取逻辑块"""
|
||
|
||
# 需要划分的逻辑块类型
|
||
LOGIC_BLOCK_TYPES = {
|
||
'for_statement',
|
||
'while_statement',
|
||
'if_statement',
|
||
'try_statement',
|
||
'with_statement',
|
||
}
|
||
|
||
def chunk_logic_blocks(
|
||
self,
|
||
macro_chunk: HierarchicalChunk,
|
||
content: str,
|
||
max_lines: int = 50 # 大于此行数的macro chunk才进行二次划分
|
||
) -> List[HierarchicalChunk]:
|
||
"""在macro chunk内部提取逻辑块"""
|
||
|
||
# 小函数不需要二次划分
|
||
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
|
||
if total_lines <= max_lines:
|
||
return []
|
||
|
||
tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
|
||
root_node = tree.root_node
|
||
|
||
micro_chunks = []
|
||
self._traverse_logic_blocks(
|
||
root_node,
|
||
macro_chunk,
|
||
content,
|
||
micro_chunks
|
||
)
|
||
|
||
return micro_chunks
|
||
|
||
def _traverse_logic_blocks(
|
||
self,
|
||
node,
|
||
parent_chunk: HierarchicalChunk,
|
||
content: str,
|
||
result: List[HierarchicalChunk]
|
||
):
|
||
"""递归遍历AST,提取逻辑块"""
|
||
|
||
if node.type in self.LOGIC_BLOCK_TYPES:
|
||
micro_chunk = self._create_micro_chunk(
|
||
node,
|
||
parent_chunk,
|
||
content
|
||
)
|
||
result.append(micro_chunk)
|
||
parent_chunk.children.append(micro_chunk)
|
||
|
||
# 继续遍历子节点
|
||
for child in node.children:
|
||
self._traverse_logic_blocks(child, parent_chunk, content, result)
|
||
|
||
def _create_micro_chunk(
|
||
self,
|
||
node,
|
||
parent_chunk: HierarchicalChunk,
|
||
content: str
|
||
) -> HierarchicalChunk:
|
||
"""创建micro chunk"""
|
||
|
||
# 计算相对于文件的行号
|
||
start_line = parent_chunk.metadata.start_line + node.start_point[0]
|
||
end_line = parent_chunk.metadata.start_line + node.end_point[0]
|
||
|
||
chunk_content = content[node.start_byte:node.end_byte]
|
||
|
||
metadata = ChunkMetadata(
|
||
chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
|
||
parent_id=parent_chunk.metadata.chunk_id,
|
||
level=2,
|
||
chunk_type=node.type,
|
||
file_path=parent_chunk.metadata.file_path,
|
||
start_line=start_line,
|
||
end_line=end_line,
|
||
symbol_name=parent_chunk.metadata.symbol_name, # 继承父符号名
|
||
context_summary=None, # 后续由LLM填充
|
||
)
|
||
|
||
return HierarchicalChunk(
|
||
metadata=metadata,
|
||
content=chunk_content,
|
||
)
|
||
```
|
||
|
||
### 3.3 统一接口:多层次分词器
|
||
|
||
```python
|
||
class HierarchicalChunker:
|
||
"""多层次分词器统一接口"""
|
||
|
||
def __init__(self, config: ChunkConfig = None):
|
||
self.config = config or ChunkConfig()
|
||
self.macro_chunker = MacroChunker()
|
||
self.micro_chunker = MicroChunker()
|
||
|
||
def chunk_file(
|
||
self,
|
||
content: str,
|
||
file_path: str,
|
||
language: str
|
||
) -> List[HierarchicalChunk]:
|
||
"""对文件进行多层次分词"""
|
||
|
||
# 第一层:符号级分词
|
||
macro_chunks = self.macro_chunker.chunk_by_symbols(
|
||
content, file_path, language
|
||
)
|
||
|
||
# 第二层:逻辑块分词
|
||
all_chunks = []
|
||
for macro_chunk in macro_chunks:
|
||
all_chunks.append(macro_chunk)
|
||
|
||
# 对大函数进行二次划分
|
||
micro_chunks = self.micro_chunker.chunk_logic_blocks(
|
||
macro_chunk, content
|
||
)
|
||
all_chunks.extend(micro_chunks)
|
||
|
||
return all_chunks
|
||
|
||
def chunk_file_with_fallback(
|
||
self,
|
||
content: str,
|
||
file_path: str,
|
||
language: str
|
||
) -> List[HierarchicalChunk]:
|
||
"""带降级策略的分词"""
|
||
|
||
try:
|
||
return self.chunk_file(content, file_path, language)
|
||
except Exception as e:
|
||
logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
|
||
# 降级到滑动窗口策略
|
||
return self._fallback_sliding_window(content, file_path, language)
|
||
```
|
||
|
||
## 4. 数据存储设计
|
||
|
||
### 4.1 数据库Schema
|
||
|
||
```sql
|
||
-- chunk表:存储所有层级的chunk
|
||
CREATE TABLE chunks (
|
||
chunk_id TEXT PRIMARY KEY,
|
||
parent_id TEXT, -- 父chunk ID,NULL表示顶层
|
||
level INTEGER NOT NULL, -- 1=macro, 2=micro
|
||
chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
|
||
file_path TEXT NOT NULL,
|
||
start_line INTEGER NOT NULL,
|
||
end_line INTEGER NOT NULL,
|
||
symbol_name TEXT,
|
||
content TEXT NOT NULL,
|
||
content_hash TEXT, -- 用于检测内容变化
|
||
|
||
-- 语义元数据(由LLM生成)
|
||
summary TEXT,
|
||
keywords TEXT, -- JSON数组
|
||
purpose TEXT,
|
||
|
||
-- 向量嵌入
|
||
embedding BLOB, -- 存储向量
|
||
|
||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||
|
||
FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
|
||
);
|
||
|
||
-- 索引优化
|
||
CREATE INDEX idx_chunks_file_path ON chunks(file_path);
|
||
CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
|
||
CREATE INDEX idx_chunks_level ON chunks(level);
|
||
CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);
|
||
```
|
||
|
||
### 4.2 向量索引
|
||
|
||
使用分层索引策略:
|
||
|
||
```python
|
||
class HierarchicalVectorStore:
|
||
"""层级化向量存储"""
|
||
|
||
def __init__(self, db_path: Path):
|
||
self.db_path = db_path
|
||
self.conn = sqlite3.connect(db_path)
|
||
|
||
def add_chunk(self, chunk: HierarchicalChunk):
|
||
"""添加chunk及其向量"""
|
||
|
||
cursor = self.conn.cursor()
|
||
cursor.execute("""
|
||
INSERT INTO chunks (
|
||
chunk_id, parent_id, level, chunk_type,
|
||
file_path, start_line, end_line, symbol_name,
|
||
content, embedding
|
||
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||
""", (
|
||
chunk.metadata.chunk_id,
|
||
chunk.metadata.parent_id,
|
||
chunk.metadata.level,
|
||
chunk.metadata.chunk_type,
|
||
chunk.metadata.file_path,
|
||
chunk.metadata.start_line,
|
||
chunk.metadata.end_line,
|
||
chunk.metadata.symbol_name,
|
||
chunk.content,
|
||
self._serialize_embedding(chunk.embedding),
|
||
))
|
||
|
||
self.conn.commit()
|
||
|
||
def search_hierarchical(
|
||
self,
|
||
query_embedding: List[float],
|
||
top_k: int = 10,
|
||
level_weights: Dict[int, float] = None
|
||
) -> List[Tuple[HierarchicalChunk, float]]:
|
||
"""层级化检索"""
|
||
|
||
# 默认权重:macro chunk权重更高
|
||
if level_weights is None:
|
||
level_weights = {1: 1.0, 2: 0.8}
|
||
|
||
# 检索所有chunk
|
||
cursor = self.conn.cursor()
|
||
cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")
|
||
|
||
results = []
|
||
for row in cursor.fetchall():
|
||
chunk = self._row_to_chunk(row)
|
||
similarity = self._cosine_similarity(
|
||
query_embedding,
|
||
chunk.embedding
|
||
)
|
||
|
||
# 根据层级应用权重
|
||
weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
|
||
results.append((chunk, weighted_score))
|
||
|
||
# 按分数排序
|
||
results.sort(key=lambda x: x[1], reverse=True)
|
||
return results[:top_k]
|
||
|
||
def get_chunk_with_context(
|
||
self,
|
||
chunk_id: str
|
||
) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
|
||
"""获取chunk及其父chunk(提供上下文)"""
|
||
|
||
cursor = self.conn.cursor()
|
||
|
||
# 获取chunk本身
|
||
cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
|
||
chunk_row = cursor.fetchone()
|
||
chunk = self._row_to_chunk(chunk_row)
|
||
|
||
# 获取父chunk
|
||
parent = None
|
||
if chunk.metadata.parent_id:
|
||
cursor.execute(
|
||
"SELECT * FROM chunks WHERE chunk_id = ?",
|
||
(chunk.metadata.parent_id,)
|
||
)
|
||
parent_row = cursor.fetchone()
|
||
if parent_row:
|
||
parent = self._row_to_chunk(parent_row)
|
||
|
||
return chunk, parent
|
||
```
|
||
|
||
## 5. LLM集成策略
|
||
|
||
### 5.1 分层生成语义元数据
|
||
|
||
```python
|
||
class HierarchicalLLMEnhancer:
|
||
"""为层级chunk生成语义元数据"""
|
||
|
||
def enhance_hierarchical_chunks(
|
||
self,
|
||
chunks: List[HierarchicalChunk]
|
||
) -> Dict[str, SemanticMetadata]:
|
||
"""
|
||
分层处理策略:
|
||
1. 先处理所有level=1的macro chunks,生成详细摘要
|
||
2. 再处理level=2的micro chunks,使用父chunk摘要作为上下文
|
||
"""
|
||
|
||
results = {}
|
||
|
||
# 第一轮:处理macro chunks
|
||
macro_chunks = [c for c in chunks if c.metadata.level == 1]
|
||
macro_metadata = self.llm_enhancer.enhance_files([
|
||
FileData(
|
||
path=c.metadata.chunk_id,
|
||
content=c.content,
|
||
language=self._detect_language(c.metadata.file_path)
|
||
)
|
||
for c in macro_chunks
|
||
])
|
||
results.update(macro_metadata)
|
||
|
||
# 第二轮:处理micro chunks(带父上下文)
|
||
micro_chunks = [c for c in chunks if c.metadata.level == 2]
|
||
for micro_chunk in micro_chunks:
|
||
parent_id = micro_chunk.metadata.parent_id
|
||
parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')
|
||
|
||
# 构建带上下文的prompt
|
||
enhanced_prompt = f"""
|
||
Parent Function: {micro_chunk.metadata.symbol_name}
|
||
Parent Summary: {parent_summary}
|
||
|
||
Code Block ({micro_chunk.metadata.chunk_type}):
|
||
```
|
||
{micro_chunk.content}
|
||
```
|
||
|
||
Generate a concise summary (1 sentence) and keywords for this specific code block.
|
||
"""
|
||
|
||
metadata = self._call_llm_with_context(enhanced_prompt)
|
||
results[micro_chunk.metadata.chunk_id] = metadata
|
||
|
||
return results
|
||
```
|
||
|
||
### 5.2 Prompt优化
|
||
|
||
针对不同层级使用不同的prompt模板:
|
||
|
||
**Macro Chunk Prompt (Level 1)**:
|
||
```
|
||
PURPOSE: Generate comprehensive semantic metadata for a complete function/class
|
||
TASK:
|
||
- Provide a detailed summary (2-3 sentences) covering what the code does and why
|
||
- Extract 8-12 relevant keywords including technical terms and domain concepts
|
||
- Identify the primary purpose/category
|
||
MODE: analysis
|
||
|
||
CODE:
|
||
```{language}
|
||
{content}
|
||
```
|
||
|
||
OUTPUT: JSON with summary, keywords, purpose
|
||
```
|
||
|
||
**Micro Chunk Prompt (Level 2)**:
|
||
```
|
||
PURPOSE: Summarize a specific logic block within a larger function
|
||
CONTEXT:
|
||
- Parent Function: {symbol_name}
|
||
- Parent Purpose: {parent_summary}
|
||
|
||
TASK:
|
||
- Provide a brief summary (1 sentence) of this specific block's role in the parent function
|
||
- Extract 3-5 keywords specific to this block's logic
|
||
MODE: analysis
|
||
|
||
CODE BLOCK ({chunk_type}):
|
||
```{language}
|
||
{content}
|
||
```
|
||
|
||
OUTPUT: JSON with summary, keywords
|
||
```
|
||
|
||
## 6. 检索增强
|
||
|
||
### 6.1 上下文扩展检索
|
||
|
||
```python
|
||
class ContextualSearchEngine:
|
||
"""支持上下文扩展的检索引擎"""
|
||
|
||
def search_with_context(
|
||
self,
|
||
query: str,
|
||
top_k: int = 10,
|
||
expand_context: bool = True
|
||
) -> List[SearchResult]:
|
||
"""
|
||
检索并自动扩展上下文
|
||
|
||
如果匹配到micro chunk,自动返回其父macro chunk作为上下文
|
||
"""
|
||
|
||
# 生成查询向量
|
||
query_embedding = self.embedder.embed_single(query)
|
||
|
||
# 层级化检索
|
||
raw_results = self.vector_store.search_hierarchical(
|
||
query_embedding,
|
||
top_k=top_k
|
||
)
|
||
|
||
# 扩展上下文
|
||
enriched_results = []
|
||
for chunk, score in raw_results:
|
||
result = SearchResult(
|
||
path=chunk.metadata.file_path,
|
||
score=score,
|
||
content=chunk.content,
|
||
start_line=chunk.metadata.start_line,
|
||
end_line=chunk.metadata.end_line,
|
||
symbol_name=chunk.metadata.symbol_name,
|
||
)
|
||
|
||
# 如果是micro chunk,获取父chunk作为上下文
|
||
if expand_context and chunk.metadata.level == 2:
|
||
parent_chunk, _ = self.vector_store.get_chunk_with_context(
|
||
chunk.metadata.chunk_id
|
||
)
|
||
if parent_chunk:
|
||
result.metadata['parent_context'] = {
|
||
'summary': parent_chunk.metadata.context_summary,
|
||
'symbol_name': parent_chunk.metadata.symbol_name,
|
||
'content': parent_chunk.content,
|
||
}
|
||
|
||
enriched_results.append(result)
|
||
|
||
return enriched_results
|
||
```
|
||
|
||
## 7. 测试策略
|
||
|
||
### 7.1 单元测试
|
||
|
||
```python
|
||
import pytest
|
||
from codexlens.semantic.hierarchical_chunker import (
|
||
HierarchicalChunker, MacroChunker, MicroChunker
|
||
)
|
||
|
||
class TestMacroChunker:
|
||
"""测试第一层分词"""
|
||
|
||
def test_extract_functions(self):
|
||
"""测试提取函数定义"""
|
||
code = '''
|
||
def calculate_total(items):
|
||
"""Calculate total price."""
|
||
total = 0
|
||
for item in items:
|
||
total += item.price
|
||
return total
|
||
|
||
def apply_discount(total, discount):
|
||
"""Apply discount to total."""
|
||
return total * (1 - discount)
|
||
'''
|
||
chunker = MacroChunker()
|
||
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||
|
||
assert len(chunks) == 2
|
||
assert chunks[0].metadata.symbol_name == 'calculate_total'
|
||
assert chunks[1].metadata.symbol_name == 'apply_discount'
|
||
assert chunks[0].metadata.level == 1
|
||
|
||
def test_extract_with_decorators(self):
|
||
"""测试提取带装饰器的函数"""
|
||
code = '''
|
||
@app.route('/api/users')
|
||
@auth_required
|
||
def get_users():
|
||
return User.query.all()
|
||
'''
|
||
chunker = MacroChunker()
|
||
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||
|
||
assert len(chunks) == 1
|
||
assert '@app.route' in chunks[0].content
|
||
assert '@auth_required' in chunks[0].content
|
||
|
||
class TestMicroChunker:
|
||
"""测试第二层分词"""
|
||
|
||
def test_extract_loop_blocks(self):
|
||
"""测试提取循环块"""
|
||
code = '''
|
||
def process_items(items):
|
||
results = []
|
||
for item in items:
|
||
if item.active:
|
||
results.append(process(item))
|
||
return results
|
||
'''
|
||
macro_chunker = MacroChunker()
|
||
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||
|
||
micro_chunker = MicroChunker()
|
||
micro_chunks = micro_chunker.chunk_logic_blocks(
|
||
macro_chunks[0], code
|
||
)
|
||
|
||
# 应该提取出for循环和if条件块
|
||
assert len(micro_chunks) >= 1
|
||
assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)
|
||
|
||
def test_skip_small_functions(self):
|
||
"""测试小函数跳过二次划分"""
|
||
code = '''
|
||
def small_func(x):
|
||
return x * 2
|
||
'''
|
||
macro_chunker = MacroChunker()
|
||
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||
|
||
micro_chunker = MicroChunker()
|
||
micro_chunks = micro_chunker.chunk_logic_blocks(
|
||
macro_chunks[0], code, max_lines=10
|
||
)
|
||
|
||
# 小函数不应该被二次划分
|
||
assert len(micro_chunks) == 0
|
||
|
||
class TestHierarchicalChunker:
|
||
"""测试完整的多层次分词"""
|
||
|
||
def test_full_hierarchical_chunking(self):
|
||
"""测试完整的层级分词流程"""
|
||
code = '''
|
||
def complex_function(data):
|
||
"""A complex function with multiple logic blocks."""
|
||
|
||
# Validation
|
||
if not data:
|
||
raise ValueError("Data is empty")
|
||
|
||
# Processing
|
||
results = []
|
||
for item in data:
|
||
try:
|
||
processed = process_item(item)
|
||
results.append(processed)
|
||
except Exception as e:
|
||
logger.error(f"Failed to process: {e}")
|
||
continue
|
||
|
||
# Aggregation
|
||
total = sum(r.value for r in results)
|
||
return total
|
||
'''
|
||
chunker = HierarchicalChunker()
|
||
chunks = chunker.chunk_file(code, 'test.py', 'python')
|
||
|
||
# 应该有1个macro chunk和多个micro chunks
|
||
macro_chunks = [c for c in chunks if c.metadata.level == 1]
|
||
micro_chunks = [c for c in chunks if c.metadata.level == 2]
|
||
|
||
assert len(macro_chunks) == 1
|
||
assert len(micro_chunks) > 0
|
||
|
||
# 验证父子关系
|
||
for micro in micro_chunks:
|
||
assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id
|
||
```
|
||
|
||
### 7.2 集成测试
|
||
|
||
```python
|
||
class TestHierarchicalIndexing:
|
||
"""测试完整的索引流程"""
|
||
|
||
def test_index_and_search(self):
|
||
"""测试分层索引和检索"""
|
||
|
||
# 1. 分词
|
||
chunker = HierarchicalChunker()
|
||
chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')
|
||
|
||
# 2. LLM增强
|
||
enhancer = HierarchicalLLMEnhancer()
|
||
metadata = enhancer.enhance_hierarchical_chunks(chunks)
|
||
|
||
# 3. 向量化
|
||
embedder = Embedder()
|
||
for chunk in chunks:
|
||
text = metadata[chunk.metadata.chunk_id].summary
|
||
chunk.embedding = embedder.embed_single(text)
|
||
|
||
# 4. 存储
|
||
vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
|
||
for chunk in chunks:
|
||
vector_store.add_chunk(chunk)
|
||
|
||
# 5. 检索
|
||
search_engine = ContextualSearchEngine(vector_store, embedder)
|
||
results = search_engine.search_with_context(
|
||
"find loop that processes items",
|
||
top_k=5
|
||
)
|
||
|
||
# 验证结果
|
||
assert len(results) > 0
|
||
assert any(r.metadata.get('parent_context') for r in results)
|
||
```
|
||
|
||
## 8. 性能优化
|
||
|
||
### 8.1 批量处理
|
||
|
||
```python
|
||
class BatchHierarchicalProcessor:
|
||
"""批量处理多个文件的层级分词"""
|
||
|
||
def process_files_batch(
|
||
self,
|
||
file_paths: List[Path],
|
||
batch_size: int = 10
|
||
):
|
||
"""批量处理,优化LLM调用"""
|
||
|
||
all_chunks = []
|
||
|
||
# 1. 批量分词
|
||
for file_path in file_paths:
|
||
content = file_path.read_text()
|
||
chunks = self.chunker.chunk_file(
|
||
content, str(file_path), self._detect_language(file_path)
|
||
)
|
||
all_chunks.extend(chunks)
|
||
|
||
# 2. 批量LLM增强(减少API调用)
|
||
macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
|
||
for i in range(0, len(macro_chunks), batch_size):
|
||
batch = macro_chunks[i:i+batch_size]
|
||
self.enhancer.enhance_batch(batch)
|
||
|
||
# 3. 批量向量化
|
||
all_texts = [c.content for c in all_chunks]
|
||
embeddings = self.embedder.embed_batch(all_texts)
|
||
for chunk, embedding in zip(all_chunks, embeddings):
|
||
chunk.embedding = embedding
|
||
|
||
# 4. 批量存储
|
||
self.vector_store.add_chunks_batch(all_chunks)
|
||
```
|
||
|
||
### 8.2 增量更新
|
||
|
||
```python
|
||
class IncrementalIndexer:
|
||
"""增量索引器:只处理变化的文件"""
|
||
|
||
def update_file(self, file_path: Path):
|
||
"""增量更新单个文件"""
|
||
|
||
content = file_path.read_text()
|
||
content_hash = hashlib.sha256(content.encode()).hexdigest()
|
||
|
||
# 检查文件是否变化
|
||
cursor = self.conn.cursor()
|
||
cursor.execute("""
|
||
SELECT content_hash FROM chunks
|
||
WHERE file_path = ? AND level = 1
|
||
LIMIT 1
|
||
""", (str(file_path),))
|
||
|
||
row = cursor.fetchone()
|
||
if row and row[0] == content_hash:
|
||
logger.info(f"File {file_path} unchanged, skipping")
|
||
return
|
||
|
||
# 删除旧chunk
|
||
cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))
|
||
|
||
# 重新索引
|
||
chunks = self.chunker.chunk_file(content, str(file_path), 'python')
|
||
# ... 继续处理
|
||
```
|
||
|
||
## 9. 潜在问题与解决方案
|
||
|
||
### 9.1 问题:超大函数的micro chunk过多
|
||
|
||
**现象**:某些遗留代码函数超过1000行,可能产生几十个micro chunks。
|
||
|
||
**解决方案**:
|
||
```python
|
||
class AdaptiveMicroChunker:
|
||
"""自适应micro分词:根据函数大小调整策略"""
|
||
|
||
def chunk_logic_blocks(self, macro_chunk, content):
|
||
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
|
||
|
||
if total_lines > 500:
|
||
# 超大函数:只提取顶层逻辑块,不递归
|
||
return self._extract_top_level_blocks(macro_chunk, content)
|
||
elif total_lines > 100:
|
||
# 大函数:递归深度限制为2层
|
||
return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
|
||
else:
|
||
# 正常函数:完全跳过micro chunking
|
||
return []
|
||
```
|
||
|
||
### 9.2 问题:tree-sitter解析失败
|
||
|
||
**现象**:对于语法错误的代码,tree-sitter解析可能失败。
|
||
|
||
**解决方案**:
|
||
```python
|
||
def chunk_file_with_fallback(self, content, file_path, language):
|
||
"""带降级策略的分词"""
|
||
|
||
try:
|
||
# 尝试层级分词
|
||
return self.chunk_file(content, file_path, language)
|
||
except TreeSitterError as e:
|
||
logger.warning(f"Tree-sitter parsing failed: {e}")
|
||
|
||
# 降级到基于正则的简单symbol提取
|
||
return self._fallback_regex_chunking(content, file_path)
|
||
except Exception as e:
|
||
logger.error(f"Chunking failed completely: {e}")
|
||
|
||
# 最终降级到滑动窗口
|
||
return self._fallback_sliding_window(content, file_path, language)
|
||
```
|
||
|
||
### 9.3 问题:向量存储空间占用
|
||
|
||
**现象**:每个chunk都存储向量,空间占用可能很大。
|
||
|
||
**解决方案**:
|
||
- **选择性向量化**:只对macro chunks和重要的micro chunks生成向量
|
||
- **向量压缩**:使用PCA或量化技术减少向量维度
|
||
- **分离存储**:向量存储在专门的向量数据库(如Faiss),SQLite只存元数据
|
||
|
||
```python
|
||
class SelectiveVectorization:
|
||
"""选择性向量化:减少存储开销"""
|
||
|
||
VECTORIZE_CHUNK_TYPES = {
|
||
'function_definition', # 总是向量化
|
||
'class_definition', # 总是向量化
|
||
'for_statement', # 循环块
|
||
'try_statement', # 异常处理
|
||
# 'if_statement' 通常不单独向量化,依赖父chunk
|
||
}
|
||
|
||
def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
|
||
"""判断是否需要为chunk生成向量"""
|
||
|
||
# Level 1总是向量化
|
||
if chunk.metadata.level == 1:
|
||
return True
|
||
|
||
# Level 2根据类型和大小决定
|
||
if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
|
||
return False
|
||
|
||
# 太小的块(<5行)不向量化
|
||
lines = chunk.metadata.end_line - chunk.metadata.start_line
|
||
if lines < 5:
|
||
return False
|
||
|
||
return True
|
||
```
|
||
|
||
## 10. 实施路线图
|
||
|
||
### Phase 1: 基础架构(2-3周)
|
||
- [x] 设计数据结构(HierarchicalChunk, ChunkMetadata)
|
||
- [ ] 实现MacroChunker(复用现有code_extractor)
|
||
- [ ] 实现基础的MicroChunker
|
||
- [ ] 数据库schema设计和migration
|
||
- [ ] 单元测试
|
||
|
||
### Phase 2: LLM集成(1-2周)
|
||
- [ ] 实现HierarchicalLLMEnhancer
|
||
- [ ] 设计分层prompt模板
|
||
- [ ] 批量处理优化
|
||
- [ ] 集成测试
|
||
|
||
### Phase 3: 向量化与检索(1-2周)
|
||
- [ ] 实现HierarchicalVectorStore
|
||
- [ ] 实现ContextualSearchEngine
|
||
- [ ] 上下文扩展逻辑
|
||
- [ ] 检索性能测试
|
||
|
||
### Phase 4: 优化与完善(2周)
|
||
- [ ] 性能优化(批量处理、增量更新)
|
||
- [ ] 降级策略完善
|
||
- [ ] 选择性向量化
|
||
- [ ] 全面测试和文档
|
||
|
||
### Phase 5: 生产部署(1周)
|
||
- [ ] CLI集成
|
||
- [ ] 配置选项暴露
|
||
- [ ] 生产环境测试
|
||
- [ ] 发布
|
||
|
||
**总计预估时间**:7-10周
|
||
|
||
## 11. 成功指标
|
||
|
||
1. **覆盖率**:95%以上的代码能被正确分词
|
||
2. **准确率**:层级关系准确率>98%
|
||
3. **检索质量**:相比单层分词,检索相关性提升30%+
|
||
4. **性能**:单文件分词<100ms,批量处理>100文件/分钟
|
||
5. **存储效率**:相比全向量化,空间占用减少40%+
|
||
|
||
## 12. 参考资料
|
||
|
||
- [Tree-sitter Documentation](https://tree-sitter.github.io/)
|
||
- [AST-based Code Analysis](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
|
||
- [Hierarchical Text Segmentation](https://arxiv.org/abs/2104.08836)
|
||
- 现有代码:`src/codexlens/semantic/chunker.py`
|