feat: add semantic graph design for static code analysis

- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities. - Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies. - Included detailed specifications for nodes and edges in the graph, along with database schema for storage. - Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
2026-02-13 02:41:50 +08:00 · 2025-12-15 09:47:18 +08:00
parent d91477ad80
commit 3ffb907a6f
17 changed files with 4557 additions and 261 deletions
--- a/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
+++ b/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
@@ -0,0 +1,972 @@
+# Docstring与LLM混合策略设计方案
+
+## 1. 背景与目标
+
+### 1.1 当前问题
+
+现有 `llm_enhancer.py` 的实现存在以下问题：
+
+1. **忽略已有文档**：对所有代码无差别调用LLM，即使已有高质量的docstring
+2. **成本浪费**：重复生成已有信息，增加API调用费用和时间
+3. **信息质量不一致**：LLM生成的内容可能不如作者编写的docstring准确
+4. **缺少作者意图**：丢失了docstring中的设计决策、使用示例等关键信息
+
+### 1.2 设计目标
+
+实现**智能混合策略**，结合docstring和LLM的优势：
+
+1. **优先使用docstring**：作为最权威的信息源
+2. **LLM作为补充**：填补docstring缺失或质量不足的部分
+3. **智能质量评估**：自动判断docstring质量，决定是否需要LLM增强
+4. **成本优化**：减少不必要的LLM调用，降低API费用
+5. **信息融合**：将docstring和LLM生成的内容有机结合
+
+## 2. 技术架构
+
+### 2.1 整体流程
+
+```
+Code Symbol
+    ↓
+[Docstring Extractor] ← 提取docstring
+    ↓
+[Quality Evaluator]   ← 评估docstring质量
+    ↓
+    ├─ High Quality → Use Docstring Directly
+    │                 + LLM Generate Keywords Only
+    │
+    ├─ Medium Quality → LLM Refine & Enhance
+    │                   (docstring作为base)
+    │
+    └─ Low/No Docstring → LLM Full Generation
+                          (现有流程)
+    ↓
+[Metadata Merger]     ← 合并docstring和LLM内容
+    ↓
+Final SemanticMetadata
+```
+
+### 2.2 核心组件
+
+```python
+from dataclasses import dataclass
+from enum import Enum
+from typing import Optional
+
+class DocstringQuality(Enum):
+    """Docstring质量等级"""
+    MISSING = "missing"           # 无docstring
+    LOW = "low"                   # 质量低：<10字符或纯占位符
+    MEDIUM = "medium"             # 质量中：有基本描述但不完整
+    HIGH = "high"                 # 质量高：详细且结构化
+
+@dataclass
+class DocstringMetadata:
+    """从docstring提取的元数据"""
+    raw_text: str
+    quality: DocstringQuality
+    summary: Optional[str] = None       # 提取的摘要
+    parameters: Optional[dict] = None   # 参数说明
+    returns: Optional[str] = None       # 返回值说明
+    examples: Optional[str] = None      # 使用示例
+    notes: Optional[str] = None         # 注意事项
+```
+
+## 3. 详细实现步骤
+
+### 3.1 Docstring提取与解析
+
+```python
+import re
+from typing import Optional
+
+class DocstringExtractor:
+    """Docstring提取器"""
+
+    # Docstring风格正则
+    GOOGLE_STYLE_PATTERN = re.compile(
+        r'Args:|Returns:|Raises:|Examples:|Note:',
+        re.MULTILINE
+    )
+
+    NUMPY_STYLE_PATTERN = re.compile(
+        r'Parameters\n-+|Returns\n-+|Examples\n-+',
+        re.MULTILINE
+    )
+
+    def extract_from_code(self, content: str, symbol: Symbol) -> Optional[str]:
+        """从代码中提取docstring"""
+
+        lines = content.splitlines()
+        start_line = symbol.range[0] - 1  # 0-indexed
+
+        # 查找函数定义后的第一个字符串字面量
+        # 通常在函数定义的下一行或几行内
+        for i in range(start_line + 1, min(start_line + 10, len(lines))):
+            line = lines[i].strip()
+
+            # Python triple-quoted string
+            if line.startswith('"""') or line.startswith("'''"):
+                return self._extract_multiline_docstring(lines, i)
+
+        return None
+
+    def _extract_multiline_docstring(
+        self,
+        lines: List[str],
+        start_idx: int
+    ) -> str:
+        """提取多行docstring"""
+
+        quote_char = '"""' if lines[start_idx].strip().startswith('"""') else "'''"
+        docstring_lines = []
+
+        # 检查是否单行docstring
+        first_line = lines[start_idx].strip()
+        if first_line.count(quote_char) == 2:
+            # 单行: """This is a docstring."""
+            return first_line.strip(quote_char).strip()
+
+        # 多行docstring
+        in_docstring = True
+        for i in range(start_idx, len(lines)):
+            line = lines[i]
+
+            if i == start_idx:
+                # 第一行：移除开始的引号
+                docstring_lines.append(line.strip().lstrip(quote_char))
+            elif quote_char in line:
+                # 结束行：移除结束的引号
+                docstring_lines.append(line.strip().rstrip(quote_char))
+                break
+            else:
+                docstring_lines.append(line.strip())
+
+        return '\n'.join(docstring_lines).strip()
+
+    def parse_docstring(self, raw_docstring: str) -> DocstringMetadata:
+        """解析docstring，提取结构化信息"""
+
+        if not raw_docstring:
+            return DocstringMetadata(
+                raw_text="",
+                quality=DocstringQuality.MISSING
+            )
+
+        # 评估质量
+        quality = self._evaluate_quality(raw_docstring)
+
+        # 提取各个部分
+        metadata = DocstringMetadata(
+            raw_text=raw_docstring,
+            quality=quality,
+        )
+
+        # 提取摘要（第一行或第一段）
+        metadata.summary = self._extract_summary(raw_docstring)
+
+        # 如果是Google或NumPy风格，提取结构化内容
+        if self.GOOGLE_STYLE_PATTERN.search(raw_docstring):
+            self._parse_google_style(raw_docstring, metadata)
+        elif self.NUMPY_STYLE_PATTERN.search(raw_docstring):
+            self._parse_numpy_style(raw_docstring, metadata)
+
+        return metadata
+
+    def _evaluate_quality(self, docstring: str) -> DocstringQuality:
+        """评估docstring质量"""
+
+        if not docstring or len(docstring.strip()) == 0:
+            return DocstringQuality.MISSING
+
+        # 检查是否是占位符
+        placeholders = ['todo', 'fixme', 'tbd', 'placeholder', '...']
+        if any(p in docstring.lower() for p in placeholders):
+            return DocstringQuality.LOW
+
+        # 长度检查
+        if len(docstring.strip()) < 10:
+            return DocstringQuality.LOW
+
+        # 检查是否有结构化内容
+        has_structure = (
+            self.GOOGLE_STYLE_PATTERN.search(docstring) or
+            self.NUMPY_STYLE_PATTERN.search(docstring)
+        )
+
+        # 检查是否有足够的描述性文本
+        word_count = len(docstring.split())
+
+        if has_structure and word_count >= 20:
+            return DocstringQuality.HIGH
+        elif word_count >= 10:
+            return DocstringQuality.MEDIUM
+        else:
+            return DocstringQuality.LOW
+
+    def _extract_summary(self, docstring: str) -> str:
+        """提取摘要（第一行或第一段）"""
+
+        lines = docstring.split('\n')
+        # 第一行非空行作为摘要
+        for line in lines:
+            if line.strip():
+                return line.strip()
+
+        return ""
+
+    def _parse_google_style(self, docstring: str, metadata: DocstringMetadata):
+        """解析Google风格docstring"""
+
+        # 提取Args
+        args_match = re.search(r'Args:(.*?)(?=Returns:|Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
+        if args_match:
+            metadata.parameters = self._parse_args_section(args_match.group(1))
+
+        # 提取Returns
+        returns_match = re.search(r'Returns:(.*?)(?=Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
+        if returns_match:
+            metadata.returns = returns_match.group(1).strip()
+
+        # 提取Examples
+        examples_match = re.search(r'Examples:(.*?)(?=Note:|\Z)', docstring, re.DOTALL)
+        if examples_match:
+            metadata.examples = examples_match.group(1).strip()
+
+    def _parse_args_section(self, args_text: str) -> dict:
+        """解析参数列表"""
+
+        params = {}
+        # 匹配 "param_name (type): description" 或 "param_name: description"
+        pattern = re.compile(r'(\w+)\s*(?:\(([^)]+)\))?\s*:\s*(.+)')
+
+        for line in args_text.split('\n'):
+            match = pattern.search(line.strip())
+            if match:
+                param_name, param_type, description = match.groups()
+                params[param_name] = {
+                    'type': param_type,
+                    'description': description.strip()
+                }
+
+        return params
+```
+
+### 3.2 智能混合策略引擎
+
+```python
+class HybridEnhancer:
+    """Docstring与LLM混合增强器"""
+
+    def __init__(
+        self,
+        llm_enhancer: LLMEnhancer,
+        docstring_extractor: DocstringExtractor
+    ):
+        self.llm_enhancer = llm_enhancer
+        self.docstring_extractor = docstring_extractor
+
+    def enhance_with_strategy(
+        self,
+        file_data: FileData,
+        symbols: List[Symbol]
+    ) -> Dict[str, SemanticMetadata]:
+        """根据docstring质量选择增强策略"""
+
+        results = {}
+
+        for symbol in symbols:
+            # 1. 提取并解析docstring
+            raw_docstring = self.docstring_extractor.extract_from_code(
+                file_data.content, symbol
+            )
+            doc_metadata = self.docstring_extractor.parse_docstring(raw_docstring or "")
+
+            # 2. 根据质量选择策略
+            semantic_metadata = self._apply_strategy(
+                file_data, symbol, doc_metadata
+            )
+
+            results[symbol.name] = semantic_metadata
+
+        return results
+
+    def _apply_strategy(
+        self,
+        file_data: FileData,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """应用混合策略"""
+
+        quality = doc_metadata.quality
+
+        if quality == DocstringQuality.HIGH:
+            # 高质量：直接使用docstring，只用LLM生成keywords
+            return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+        elif quality == DocstringQuality.MEDIUM:
+            # 中等质量：让LLM精炼和增强
+            return self._refine_with_llm(file_data, symbol, doc_metadata)
+
+        else:  # LOW or MISSING
+            # 低质量或无：完全由LLM生成
+            return self._full_llm_generation(file_data, symbol)
+
+    def _use_docstring_with_llm_keywords(
+        self,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """策略1：使用docstring，LLM只生成keywords"""
+
+        # 直接使用docstring的摘要
+        summary = doc_metadata.summary or doc_metadata.raw_text[:200]
+
+        # 使用LLM生成keywords
+        keywords = self._generate_keywords_only(summary, symbol.name)
+
+        # 从docstring推断purpose
+        purpose = self._infer_purpose_from_docstring(doc_metadata)
+
+        return SemanticMetadata(
+            summary=summary,
+            keywords=keywords,
+            purpose=purpose,
+            file_path=symbol.file_path if hasattr(symbol, 'file_path') else None,
+            symbol_name=symbol.name,
+            llm_tool="hybrid_docstring_primary",
+        )
+
+    def _refine_with_llm(
+        self,
+        file_data: FileData,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """策略2：让LLM精炼和增强docstring"""
+
+        prompt = f"""
+PURPOSE: Refine and enhance an existing docstring for better semantic search
+TASK:
+- Review the existing docstring
+- Generate a concise summary (1-2 sentences) that captures the core purpose
+- Extract 8-12 relevant keywords for search
+- Identify the functional category/purpose
+
+EXISTING DOCSTRING:
+{doc_metadata.raw_text}
+
+CODE CONTEXT:
+Function: {symbol.name}
+```{file_data.language}
+{self._get_symbol_code(file_data.content, symbol)}
+```
+
+OUTPUT: JSON format
+{{
+    "summary": "refined summary based on docstring and code",
+    "keywords": ["keyword1", "keyword2", ...],
+    "purpose": "category"
+}}
+"""
+
+        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
+        if response['success']:
+            data = json.loads(self.llm_enhancer._extract_json(response['stdout']))
+            return SemanticMetadata(
+                summary=data.get('summary', doc_metadata.summary),
+                keywords=data.get('keywords', []),
+                purpose=data.get('purpose', 'unknown'),
+                file_path=file_data.path,
+                symbol_name=symbol.name,
+                llm_tool="hybrid_llm_refined",
+            )
+
+        # Fallback: 使用docstring
+        return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+    def _full_llm_generation(
+        self,
+        file_data: FileData,
+        symbol: Symbol
+    ) -> SemanticMetadata:
+        """策略3：完全由LLM生成（原有流程）"""
+
+        # 复用现有的LLM enhancer
+        code_snippet = self._get_symbol_code(file_data.content, symbol)
+
+        results = self.llm_enhancer.enhance_files([
+            FileData(
+                path=f"{file_data.path}:{symbol.name}",
+                content=code_snippet,
+                language=file_data.language
+            )
+        ])
+
+        return results.get(f"{file_data.path}:{symbol.name}", SemanticMetadata(
+            summary="",
+            keywords=[],
+            purpose="unknown",
+            file_path=file_data.path,
+            symbol_name=symbol.name,
+            llm_tool="hybrid_llm_full",
+        ))
+
+    def _generate_keywords_only(self, summary: str, symbol_name: str) -> List[str]:
+        """仅生成keywords（快速LLM调用）"""
+
+        prompt = f"""
+PURPOSE: Generate search keywords for a code function
+TASK: Extract 5-8 relevant keywords from the summary
+
+Summary: {summary}
+Function Name: {symbol_name}
+
+OUTPUT: Comma-separated keywords
+"""
+
+        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
+        if response['success']:
+            keywords_str = response['stdout'].strip()
+            return [k.strip() for k in keywords_str.split(',')]
+
+        # Fallback: 从摘要提取关键词
+        return self._extract_keywords_heuristic(summary)
+
+    def _extract_keywords_heuristic(self, text: str) -> List[str]:
+        """启发式关键词提取（无需LLM）"""
+
+        # 简单实现：提取名词性词组
+        import re
+        words = re.findall(r'\b[a-z]{4,}\b', text.lower())
+
+        # 过滤常见词
+        stopwords = {'this', 'that', 'with', 'from', 'have', 'will', 'your', 'their'}
+        keywords = [w for w in words if w not in stopwords]
+
+        return list(set(keywords))[:8]
+
+    def _infer_purpose_from_docstring(self, doc_metadata: DocstringMetadata) -> str:
+        """从docstring推断purpose（无需LLM）"""
+
+        summary = doc_metadata.summary.lower()
+
+        # 简单规则匹配
+        if 'authenticate' in summary or 'login' in summary:
+            return 'auth'
+        elif 'validate' in summary or 'check' in summary:
+            return 'validation'
+        elif 'parse' in summary or 'format' in summary:
+            return 'data_processing'
+        elif 'api' in summary or 'endpoint' in summary:
+            return 'api'
+        elif 'database' in summary or 'query' in summary:
+            return 'data'
+        elif 'test' in summary:
+            return 'test'
+
+        return 'util'
+
+    def _get_symbol_code(self, content: str, symbol: Symbol) -> str:
+        """提取符号的代码"""
+
+        lines = content.splitlines()
+        start, end = symbol.range
+        return '\n'.join(lines[start-1:end])
+```
+
+### 3.3 成本优化统计
+
+```python
+@dataclass
+class EnhancementStats:
+    """增强统计"""
+    total_symbols: int = 0
+    used_docstring_only: int = 0      # 只使用docstring
+    llm_keywords_only: int = 0        # LLM只生成keywords
+    llm_refined: int = 0              # LLM精炼docstring
+    llm_full_generation: int = 0      # LLM完全生成
+    total_llm_calls: int = 0
+    estimated_cost_savings: float = 0.0  # 相比全用LLM节省的成本
+
+class CostOptimizedEnhancer(HybridEnhancer):
+    """带成本统计的增强器"""
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.stats = EnhancementStats()
+
+    def enhance_with_strategy(
+        self,
+        file_data: FileData,
+        symbols: List[Symbol]
+    ) -> Dict[str, SemanticMetadata]:
+        """增强并统计成本"""
+
+        self.stats.total_symbols += len(symbols)
+        results = super().enhance_with_strategy(file_data, symbols)
+
+        # 统计各策略使用情况
+        for metadata in results.values():
+            if metadata.llm_tool == "hybrid_docstring_primary":
+                self.stats.used_docstring_only += 1
+                self.stats.llm_keywords_only += 1
+                self.stats.total_llm_calls += 1
+            elif metadata.llm_tool == "hybrid_llm_refined":
+                self.stats.llm_refined += 1
+                self.stats.total_llm_calls += 1
+            elif metadata.llm_tool == "hybrid_llm_full":
+                self.stats.llm_full_generation += 1
+                self.stats.total_llm_calls += 1
+
+        # 计算成本节省（假设：keywords-only调用成本为full的20%）
+        keywords_only_savings = self.stats.llm_keywords_only * 0.8  # 节省80%
+        full_generation_count = self.stats.total_symbols - self.stats.llm_keywords_only
+        self.stats.estimated_cost_savings = keywords_only_savings / full_generation_count if full_generation_count > 0 else 0
+
+        return results
+
+    def print_stats(self):
+        """打印统计信息"""
+
+        print("=== Enhancement Statistics ===")
+        print(f"Total Symbols: {self.stats.total_symbols}")
+        print(f"Used Docstring (with LLM keywords): {self.stats.used_docstring_only} ({self.stats.used_docstring_only/self.stats.total_symbols*100:.1f}%)")
+        print(f"LLM Refined Docstring: {self.stats.llm_refined} ({self.stats.llm_refined/self.stats.total_symbols*100:.1f}%)")
+        print(f"LLM Full Generation: {self.stats.llm_full_generation} ({self.stats.llm_full_generation/self.stats.total_symbols*100:.1f}%)")
+        print(f"Total LLM Calls: {self.stats.total_llm_calls}")
+        print(f"Estimated Cost Savings: {self.stats.estimated_cost_savings*100:.1f}%")
+```
+
+## 4. 配置选项
+
+```python
+@dataclass
+class HybridEnhancementConfig:
+    """混合增强配置"""
+
+    # 是否启用混合策略（False则回退到全LLM模式）
+    enable_hybrid: bool = True
+
+    # 质量阈值配置
+    use_docstring_threshold: DocstringQuality = DocstringQuality.HIGH
+    refine_docstring_threshold: DocstringQuality = DocstringQuality.MEDIUM
+
+    # 是否为高质量docstring生成keywords
+    generate_keywords_for_docstring: bool = True
+
+    # LLM配置
+    llm_tool: str = "gemini"
+    llm_timeout: int = 300000
+
+    # 成本优化
+    batch_size: int = 5              # 批量处理大小
+    skip_test_files: bool = True     # 跳过测试文件（通常docstring较少）
+
+    # 调试选项
+    log_strategy_decisions: bool = False  # 记录策略决策日志
+```
+
+## 5. 测试策略
+
+### 5.1 单元测试
+
+```python
+import pytest
+
+class TestDocstringExtractor:
+    """测试docstring提取"""
+
+    def test_extract_google_style(self):
+        """测试Google风格docstring提取"""
+        code = '''
+def calculate_total(items, discount=0):
+    """Calculate total price with optional discount.
+
+    This function processes a list of items and applies
+    a discount if specified.
+
+    Args:
+        items (list): List of item objects with price attribute.
+        discount (float): Discount percentage (0-1). Defaults to 0.
+
+    Returns:
+        float: Total price after discount.
+
+    Examples:
+        >>> calculate_total([item1, item2], discount=0.1)
+        90.0
+    """
+    total = sum(item.price for item in items)
+    return total * (1 - discount)
+'''
+        extractor = DocstringExtractor()
+        symbol = Symbol(name='calculate_total', kind='function', range=(1, 18))
+        docstring = extractor.extract_from_code(code, symbol)
+
+        assert docstring is not None
+        metadata = extractor.parse_docstring(docstring)
+
+        assert metadata.quality == DocstringQuality.HIGH
+        assert 'Calculate total price' in metadata.summary
+        assert metadata.parameters is not None
+        assert 'items' in metadata.parameters
+        assert metadata.returns is not None
+        assert metadata.examples is not None
+
+    def test_extract_low_quality_docstring(self):
+        """测试低质量docstring识别"""
+        code = '''
+def process():
+    """TODO"""
+    pass
+'''
+        extractor = DocstringExtractor()
+        symbol = Symbol(name='process', kind='function', range=(1, 3))
+        docstring = extractor.extract_from_code(code, symbol)
+
+        metadata = extractor.parse_docstring(docstring)
+        assert metadata.quality == DocstringQuality.LOW
+
+class TestHybridEnhancer:
+    """测试混合增强器"""
+
+    def test_high_quality_docstring_strategy(self):
+        """测试高质量docstring使用策略"""
+
+        extractor = DocstringExtractor()
+        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
+        hybrid = HybridEnhancer(llm_enhancer, extractor)
+
+        # 模拟高质量docstring
+        doc_metadata = DocstringMetadata(
+            raw_text="Validate user credentials against database.",
+            quality=DocstringQuality.HIGH,
+            summary="Validate user credentials against database."
+        )
+
+        symbol = Symbol(name='validate_user', kind='function', range=(1, 10))
+
+        result = hybrid._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+        # 应该使用docstring的摘要
+        assert result.summary == doc_metadata.summary
+        # 应该有keywords（可能由LLM或启发式生成）
+        assert len(result.keywords) > 0
+
+    def test_cost_optimization(self):
+        """测试成本优化效果"""
+
+        enhancer = CostOptimizedEnhancer(
+            llm_enhancer=LLMEnhancer(LLMConfig(enabled=False)),  # Mock
+            docstring_extractor=DocstringExtractor()
+        )
+
+        # 模拟处理10个symbol，其中5个有高质量docstring
+        # 预期：5个只调用keywords生成，5个完整LLM
+        # 总调用10次，但成本降低（keywords调用更便宜）
+
+        # 实际测试需要mock LLM调用
+        pass
+```
+
+### 5.2 集成测试
+
+```python
+class TestHybridEnhancementPipeline:
+    """测试完整的混合增强流程"""
+
+    def test_full_pipeline(self):
+        """测试完整流程：代码 -> docstring提取 -> 质量评估 -> 策略选择 -> 增强"""
+
+        code = '''
+def authenticate_user(username, password):
+    """Authenticate user with username and password.
+
+    Args:
+        username (str): User's username
+        password (str): User's password
+
+    Returns:
+        bool: True if authenticated, False otherwise
+    """
+    # ... implementation
+    pass
+
+def helper_func(x):
+    # No docstring
+    return x * 2
+'''
+
+        file_data = FileData(path='auth.py', content=code, language='python')
+        symbols = [
+            Symbol(name='authenticate_user', kind='function', range=(1, 11)),
+            Symbol(name='helper_func', kind='function', range=(13, 15)),
+        ]
+
+        extractor = DocstringExtractor()
+        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
+        hybrid = CostOptimizedEnhancer(llm_enhancer, extractor)
+
+        results = hybrid.enhance_with_strategy(file_data, symbols)
+
+        # authenticate_user 应该使用docstring
+        assert results['authenticate_user'].llm_tool == "hybrid_docstring_primary"
+
+        # helper_func 应该完全LLM生成
+        assert results['helper_func'].llm_tool == "hybrid_llm_full"
+
+        # 统计
+        assert hybrid.stats.total_symbols == 2
+        assert hybrid.stats.used_docstring_only >= 1
+        assert hybrid.stats.llm_full_generation >= 1
+```
+
+## 6. 实施路线图
+
+### Phase 1: 基础设施（1周）
+- [x] 设计数据结构（DocstringMetadata, DocstringQuality）
+- [ ] 实现DocstringExtractor（提取和解析）
+- [ ] 支持Python docstring（Google/NumPy/reStructuredText风格）
+- [ ] 单元测试
+
+### Phase 2: 质量评估（1周）
+- [ ] 实现质量评估算法
+- [ ] 启发式规则优化
+- [ ] 测试不同质量的docstring
+- [ ] 调整阈值参数
+
+### Phase 3: 混合策略（1-2周）
+- [ ] 实现HybridEnhancer
+- [ ] 三种策略实现（docstring-only, refine, full-llm）
+- [ ] 策略选择逻辑
+- [ ] 集成测试
+
+### Phase 4: 成本优化（1周）
+- [ ] 实现CostOptimizedEnhancer
+- [ ] 统计和监控
+- [ ] 批量处理优化
+- [ ] 性能测试
+
+### Phase 5: 多语言支持（1-2周）
+- [ ] JavaScript/TypeScript JSDoc
+- [ ] Java Javadoc
+- [ ] 其他语言docstring格式
+
+### Phase 6: 集成与部署（1周）
+- [ ] 集成到现有llm_enhancer
+- [ ] CLI选项暴露
+- [ ] 配置文件支持
+- [ ] 文档和示例
+
+**总计预估时间**：6-8周
+
+## 7. 性能与成本分析
+
+### 7.1 预期成本节省
+
+假设场景：分析1000个函数
+
+| Docstring质量分布 | 占比 | LLM调用策略 | 相对成本 |
+|------------------|------|------------|---------|
+| High (有详细docstring) | 30% | 只生成keywords | 20% |
+| Medium (有基本docstring) | 40% | 精炼增强 | 60% |
+| Low/Missing | 30% | 完全生成 | 100% |
+
+**总成本计算**：
+- 纯LLM模式：1000 * 100% = 1000 units
+- 混合模式：300*20% + 400*60% + 300*100% = 60 + 240 + 300 = 600 units
+- **节省**：40%
+
+### 7.2 质量对比
+
+| 指标 | 纯LLM模式 | 混合模式 |
+|------|----------|---------|
+| 准确性 | 中（可能有幻觉） | **高**（docstring权威） |
+| 一致性 | 中（依赖prompt） | **高**（保留作者风格） |
+| 覆盖率 | **高**（全覆盖） | 高（98%+） |
+| 成本 | 高 | **低**（节省40%） |
+| 速度 | 慢（所有文件） | **快**（减少LLM调用） |
+
+## 8. 潜在问题与解决方案
+
+### 8.1 问题：Docstring过时
+
+**现象**：代码已修改，但docstring未更新，导致信息不准确。
+
+**解决方案**：
+```python
+class DocstringFreshnessChecker:
+    """检查docstring与代码的一致性"""
+
+    def check_freshness(
+        self,
+        symbol: Symbol,
+        code: str,
+        doc_metadata: DocstringMetadata
+    ) -> bool:
+        """检查docstring是否与代码匹配"""
+
+        # 检查1: 参数列表是否匹配
+        if doc_metadata.parameters:
+            actual_params = self._extract_actual_parameters(code)
+            documented_params = set(doc_metadata.parameters.keys())
+
+            if actual_params != documented_params:
+                logger.warning(
+                    f"Parameter mismatch in {symbol.name}: "
+                    f"code has {actual_params}, doc has {documented_params}"
+                )
+                return False
+
+        # 检查2: 使用LLM验证一致性
+        # TODO: 构建验证prompt
+
+        return True
+```
+
+### 8.2 问题：不同docstring风格混用
+
+**现象**：同一项目中使用多种docstring风格（Google, NumPy, 自定义）。
+
+**解决方案**：
+```python
+class MultiStyleDocstringParser:
+    """支持多种docstring风格的解析器"""
+
+    def parse(self, docstring: str) -> DocstringMetadata:
+        """自动检测并解析不同风格"""
+
+        # 尝试各种解析器
+        for parser in [
+            GoogleStyleParser(),
+            NumpyStyleParser(),
+            ReStructuredTextParser(),
+            SimpleParser(),  # Fallback
+        ]:
+            try:
+                metadata = parser.parse(docstring)
+                if metadata.quality != DocstringQuality.LOW:
+                    return metadata
+            except Exception:
+                continue
+
+        # 如果所有解析器都失败，返回简单解析结果
+        return SimpleParser().parse(docstring)
+```
+
+### 8.3 问题：多语言docstring提取差异
+
+**现象**：不同语言的docstring格式和位置不同。
+
+**解决方案**：
+```python
+class LanguageSpecificExtractor:
+    """语言特定的docstring提取器"""
+
+    def extract(self, language: str, code: str, symbol: Symbol) -> Optional[str]:
+        """根据语言选择合适的提取器"""
+
+        extractors = {
+            'python': PythonDocstringExtractor(),
+            'javascript': JSDocExtractor(),
+            'typescript': TSDocExtractor(),
+            'java': JavadocExtractor(),
+        }
+
+        extractor = extractors.get(language, GenericExtractor())
+        return extractor.extract(code, symbol)
+
+class JSDocExtractor:
+    """JavaScript/TypeScript JSDoc提取器"""
+
+    def extract(self, code: str, symbol: Symbol) -> Optional[str]:
+        """提取JSDoc注释"""
+
+        lines = code.splitlines()
+        start_line = symbol.range[0] - 1
+
+        # 向上查找 /** ... */ 注释
+        for i in range(start_line - 1, max(0, start_line - 20), -1):
+            if '*/' in lines[i]:
+                # 找到结束标记，向上提取
+                return self._extract_jsdoc_block(lines, i)
+
+        return None
+```
+
+## 9. 配置示例
+
+### 9.1 配置文件
+
+```yaml
+# .codexlens/hybrid_enhancement.yaml
+
+hybrid_enhancement:
+  enabled: true
+
+  # 质量阈值
+  quality_thresholds:
+    use_docstring: high      # high/medium/low
+    refine_docstring: medium
+
+  # LLM选项
+  llm:
+    tool: gemini
+    fallback: qwen
+    timeout_ms: 300000
+    batch_size: 5
+
+  # 成本优化
+  cost_optimization:
+    generate_keywords_for_docstring: true
+    skip_test_files: true
+    skip_private_methods: false
+
+  # 语言支持
+  languages:
+    python:
+      styles: [google, numpy, sphinx]
+    javascript:
+      styles: [jsdoc]
+    java:
+      styles: [javadoc]
+
+  # 监控
+  logging:
+    log_strategy_decisions: false
+    log_cost_savings: true
+```
+
+### 9.2 CLI使用
+
+```bash
+# 使用混合策略增强
+codex-lens enhance . --hybrid --tool gemini
+
+# 查看成本统计
+codex-lens enhance . --hybrid --show-stats
+
+# 仅对高质量docstring生成keywords
+codex-lens enhance . --hybrid --keywords-only
+
+# 禁用混合模式，回退到纯LLM
+codex-lens enhance . --no-hybrid --tool gemini
+```
+
+## 10. 成功指标
+
+1. **成本节省**：相比纯LLM模式，降低API调用成本40%+
+2. **准确性提升**：使用docstring的符号，元数据准确率>95%
+3. **覆盖率**：98%+的符号有语义元数据（docstring或LLM生成）
+4. **速度提升**：整体处理速度提升30%+（减少LLM调用）
+5. **用户满意度**：保留docstring信息，开发者认可度高
+
+## 11. 参考资料
+
+- [PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
+- [Google Python Style Guide - Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
+- [NumPy Docstring Standard](https://numpydoc.readthedocs.io/en/latest/format.html)
+- [JSDoc Documentation](https://jsdoc.app/)
+- [Javadoc Tool](https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html)