Remove LLM enhancement features and related components as per user request. This includes the deletion of source code files, CLI commands, front-end components, tests, scripts, and documentation associated with LLM functionality. Simplified dependencies and reduced complexity while retaining core vector search capabilities. Validation of changes confirmed successful removal and functionality.

2026-02-05 01:50:27 +08:00 · 2025-12-16 21:38:27 +08:00
parent d21066c282
commit b702791c2c
21 changed files with 375 additions and 7193 deletions
--- a/codex-lens/docs/CLI_INTEGRATION_SUMMARY.md
+++ b/codex-lens/docs/CLI_INTEGRATION_SUMMARY.md
@@ -1,316 +0,0 @@
-# CLI Integration Summary - Embedding Management
-
-**Date**: 2025-12-16
-**Version**: v0.5.1
-**Status**: ✅ Complete
-
---
-
-## Overview
-
-Completed integration of embedding management commands into the CodexLens CLI, making vector search functionality more accessible and user-friendly. Users no longer need to run standalone scripts - all embedding operations are now available through simple CLI commands.
-
-## What Changed
-
-### 1. New CLI Commands
-
-#### `codexlens embeddings-generate`
-
-**Purpose**: Generate semantic embeddings for code search
-
-**Features**:
- Accepts project directory or direct `_index.db` path
- Auto-finds index for project paths using registry
- Supports 4 model profiles (fast, code, multilingual, balanced)
- Force regeneration with `--force` flag
- Configurable chunk size
- Verbose mode with progress updates
- JSON output mode for scripting
-
-**Examples**:
-```bash
-# Generate embeddings for a project
-codexlens embeddings-generate ~/projects/my-app
-
-# Use specific model
-codexlens embeddings-generate ~/projects/my-app --model fast
-
-# Force regeneration
-codexlens embeddings-generate ~/projects/my-app --force
-
-# Verbose output
-codexlens embeddings-generate ~/projects/my-app -v
-```
-
-**Output**:
-```
-Generating embeddings
-Index: ~/.codexlens/indexes/my-app/_index.db
-Model: code
-
-✓ Embeddings generated successfully!
-  Model: jinaai/jina-embeddings-v2-base-code
-  Chunks created: 1,234
-  Files processed: 89
-  Time: 45.2s
-
-Use vector search with:
-  codexlens search 'your query' --mode pure-vector
-```
-
-#### `codexlens embeddings-status`
-
-**Purpose**: Check embedding status for indexes
-
-**Features**:
- Check all indexes (no arguments)
- Check specific project or index
- Summary table view
- File coverage statistics
- Missing files detection
- JSON output mode
-
-**Examples**:
-```bash
-# Check all indexes
-codexlens embeddings-status
-
-# Check specific project
-codexlens embeddings-status ~/projects/my-app
-
-# Check specific index
-codexlens embeddings-status ~/.codexlens/indexes/my-app/_index.db
-```
-
-**Output (all indexes)**:
-```
-Embedding Status Summary
-Index root: ~/.codexlens/indexes
-
-Total indexes: 5
-Indexes with embeddings: 3/5
-Total chunks: 4,567
-
-Project      Files  Chunks  Coverage  Status
-my-app        89    1,234    100.0%      ✓
-other-app    145    2,456     95.5%      ✓
-test-proj     23      877    100.0%      ✓
-no-emb       67        0       0.0%      —
-legacy       45        0       0.0%      —
-```
-
-**Output (specific project)**:
-```
-Embedding Status
-Index: ~/.codexlens/indexes/my-app/_index.db
-
-✓ Embeddings available
-  Total chunks: 1,234
-  Total files: 89
-  Files with embeddings: 89/89
-  Coverage: 100.0%
-```
-
-### 2. Improved Error Messages
-
-Enhanced error messages throughout the search pipeline to guide users to the new CLI commands:
-
-**Before**:
-```
-DEBUG: No semantic_chunks table found
-DEBUG: Vector store is empty
-```
-
-**After**:
-```
-INFO: No embeddings found in index. Generate embeddings with: codexlens embeddings-generate ~/projects/my-app
-WARNING: Pure vector search returned no results. This usually means embeddings haven't been generated. Run: codexlens embeddings-generate ~/projects/my-app
-```
-
-**Locations Updated**:
- `src/codexlens/search/hybrid_search.py` - Added helpful info messages
- `src/codexlens/cli/commands.py` - Improved error hints in CLI output
-
-### 3. Backend Infrastructure
-
-Created `src/codexlens/cli/embedding_manager.py` with reusable functions:
-
-**Functions**:
- `check_index_embeddings(index_path)` - Check embedding status
- `generate_embeddings(index_path, ...)` - Generate embeddings
- `find_all_indexes(scan_dir)` - Find all indexes in directory
- `get_embedding_stats_summary(index_root)` - Aggregate stats for all indexes
-
-**Architecture**:
- Follows same pattern as `model_manager.py` for consistency
- Returns standardized result dictionaries `{"success": bool, "result": dict}`
- Supports progress callbacks for UI updates
- Handles all error cases gracefully
-
-### 4. Documentation Updates
-
-Updated user-facing documentation to reference new CLI commands:
-
-**Files Updated**:
-1. `docs/PURE_VECTOR_SEARCH_GUIDE.md`
-   - Changed all references from `python scripts/generate_embeddings.py` to `codexlens embeddings-generate`
-   - Updated troubleshooting section
-   - Added new `embeddings-status` examples
-
-2. `docs/IMPLEMENTATION_SUMMARY.md`
-   - Marked P1 priorities as complete
-   - Added CLI integration to checklist
-   - Updated feature list
-
-3. `src/codexlens/cli/commands.py`
-   - Updated search command help text to reference new commands
-
-## Files Created
-
-| File | Purpose | Lines |
-|------|---------|-------|
-| `src/codexlens/cli/embedding_manager.py` | Backend logic for embedding operations | ~290 |
-| `docs/CLI_INTEGRATION_SUMMARY.md` | This document | ~400 |
-
-## Files Modified
-
-| File | Changes |
-|------|---------|
-| `src/codexlens/cli/commands.py` | Added 2 new commands (~270 lines) |
-| `src/codexlens/search/hybrid_search.py` | Improved error messages (~20 lines) |
-| `docs/PURE_VECTOR_SEARCH_GUIDE.md` | Updated CLI references (~10 changes) |
-| `docs/IMPLEMENTATION_SUMMARY.md` | Marked P1 complete (~10 lines) |
-
-## Testing Workflow
-
-### Manual Testing Checklist
-
- [ ] `codexlens embeddings-status` with no indexes
- [ ] `codexlens embeddings-status` with multiple indexes
- [ ] `codexlens embeddings-status ~/projects/my-app` (project path)
- [ ] `codexlens embeddings-status ~/.codexlens/indexes/my-app/_index.db` (direct path)
- [ ] `codexlens embeddings-generate ~/projects/my-app` (first time)
- [ ] `codexlens embeddings-generate ~/projects/my-app` (already exists, should error)
- [ ] `codexlens embeddings-generate ~/projects/my-app --force` (regenerate)
- [ ] `codexlens embeddings-generate ~/projects/my-app --model fast`
- [ ] `codexlens embeddings-generate ~/projects/my-app -v` (verbose output)
- [ ] `codexlens search "query" --mode pure-vector` (with embeddings)
- [ ] `codexlens search "query" --mode pure-vector` (without embeddings, check error message)
- [ ] `codexlens embeddings-status --json` (JSON output)
- [ ] `codexlens embeddings-generate ~/projects/my-app --json` (JSON output)
-
-### Expected Test Results
-
-**Without embeddings**:
-```bash
-$ codexlens embeddings-status ~/projects/my-app
-Embedding Status
-Index: ~/.codexlens/indexes/my-app/_index.db
-
-— No embeddings found
-  Total files indexed: 89
-
-Generate embeddings with:
-  codexlens embeddings-generate ~/projects/my-app
-```
-
-**After generating embeddings**:
-```bash
-$ codexlens embeddings-generate ~/projects/my-app
-Generating embeddings
-Index: ~/.codexlens/indexes/my-app/_index.db
-Model: code
-
-✓ Embeddings generated successfully!
-  Model: jinaai/jina-embeddings-v2-base-code
-  Chunks created: 1,234
-  Files processed: 89
-  Time: 45.2s
-```
-
-**Status after generation**:
-```bash
-$ codexlens embeddings-status ~/projects/my-app
-Embedding Status
-Index: ~/.codexlens/indexes/my-app/_index.db
-
-✓ Embeddings available
-  Total chunks: 1,234
-  Total files: 89
-  Files with embeddings: 89/89
-  Coverage: 100.0%
-```
-
-**Pure vector search**:
-```bash
-$ codexlens search "how to authenticate users" --mode pure-vector
-Found 5 results in 12.3ms:
-
-auth/authentication.py:42  [0.876]
-  def authenticate_user(username: str, password: str) -> bool:
-      '''Verify user credentials against database.'''
-      return check_password(username, password)
-...
-```
-
-## User Experience Improvements
-
-| Before | After |
-|--------|-------|
-| Run separate Python script | Single CLI command |
-| Manual path resolution | Auto-finds project index |
-| No status check | `embeddings-status` command |
-| Generic error messages | Helpful hints with commands |
-| Script-level documentation | Integrated `--help` text |
-
-## Backward Compatibility
-
- ✅ Standalone script `scripts/generate_embeddings.py` still works
- ✅ All existing search modes unchanged
- ✅ Pure vector implementation backward compatible
- ✅ No breaking changes to APIs
-
-## Next Steps (Optional)
-
-Future enhancements users might want:
-
-1. **Batch operations**:
-   ```bash
-   codexlens embeddings-generate --all  # Generate for all indexes
-   ```
-
-2. **Incremental updates**:
-   ```bash
-   codexlens embeddings-update ~/projects/my-app  # Only changed files
-   ```
-
-3. **Embedding cleanup**:
-   ```bash
-   codexlens embeddings-delete ~/projects/my-app  # Remove embeddings
-   ```
-
-4. **Model management integration**:
-   ```bash
-   codexlens embeddings-generate ~/projects/my-app --download-model
-   ```
-
---
-
-## Summary
-
-✅ **Completed**: Full CLI integration for embedding management
-✅ **User Experience**: Simplified from multi-step script to single command
-✅ **Error Handling**: Helpful messages guide users to correct commands
-✅ **Documentation**: All references updated to new CLI commands
-✅ **Testing**: Manual testing checklist prepared
-
-**Impact**: Users can now manage embeddings with intuitive CLI commands instead of running scripts, making vector search more accessible and easier to use.
-
-**Command Summary**:
-```bash
-codexlens embeddings-status [path]                     # Check status
-codexlens embeddings-generate <path> [--model] [--force]  # Generate
-codexlens search "query" --mode pure-vector            # Use vector search
-```
-
-The integration is **complete and ready for testing**.
--- a/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
+++ b/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
@@ -1,972 +0,0 @@
-# Docstring与LLM混合策略设计方案
-
-## 1. 背景与目标
-
-### 1.1 当前问题
-
-现有 `llm_enhancer.py` 的实现存在以下问题：
-
-1. **忽略已有文档**：对所有代码无差别调用LLM，即使已有高质量的docstring
-2. **成本浪费**：重复生成已有信息，增加API调用费用和时间
-3. **信息质量不一致**：LLM生成的内容可能不如作者编写的docstring准确
-4. **缺少作者意图**：丢失了docstring中的设计决策、使用示例等关键信息
-
-### 1.2 设计目标
-
-实现**智能混合策略**，结合docstring和LLM的优势：
-
-1. **优先使用docstring**：作为最权威的信息源
-2. **LLM作为补充**：填补docstring缺失或质量不足的部分
-3. **智能质量评估**：自动判断docstring质量，决定是否需要LLM增强
-4. **成本优化**：减少不必要的LLM调用，降低API费用
-5. **信息融合**：将docstring和LLM生成的内容有机结合
-
-## 2. 技术架构
-
-### 2.1 整体流程
-
-```
-Code Symbol
-    ↓
-[Docstring Extractor] ← 提取docstring
-    ↓
-[Quality Evaluator]   ← 评估docstring质量
-    ↓
-    ├─ High Quality → Use Docstring Directly
-    │                 + LLM Generate Keywords Only
-    │
-    ├─ Medium Quality → LLM Refine & Enhance
-    │                   (docstring作为base)
-    │
-    └─ Low/No Docstring → LLM Full Generation
-                          (现有流程)
-    ↓
-[Metadata Merger]     ← 合并docstring和LLM内容
-    ↓
-Final SemanticMetadata
-```
-
-### 2.2 核心组件
-
-```python
-from dataclasses import dataclass
-from enum import Enum
-from typing import Optional
-
-class DocstringQuality(Enum):
-    """Docstring质量等级"""
-    MISSING = "missing"           # 无docstring
-    LOW = "low"                   # 质量低：<10字符或纯占位符
-    MEDIUM = "medium"             # 质量中：有基本描述但不完整
-    HIGH = "high"                 # 质量高：详细且结构化
-
-@dataclass
-class DocstringMetadata:
-    """从docstring提取的元数据"""
-    raw_text: str
-    quality: DocstringQuality
-    summary: Optional[str] = None       # 提取的摘要
-    parameters: Optional[dict] = None   # 参数说明
-    returns: Optional[str] = None       # 返回值说明
-    examples: Optional[str] = None      # 使用示例
-    notes: Optional[str] = None         # 注意事项
-```
-
-## 3. 详细实现步骤
-
-### 3.1 Docstring提取与解析
-
-```python
-import re
-from typing import Optional
-
-class DocstringExtractor:
-    """Docstring提取器"""
-
-    # Docstring风格正则
-    GOOGLE_STYLE_PATTERN = re.compile(
-        r'Args:|Returns:|Raises:|Examples:|Note:',
-        re.MULTILINE
-    )
-
-    NUMPY_STYLE_PATTERN = re.compile(
-        r'Parameters\n-+|Returns\n-+|Examples\n-+',
-        re.MULTILINE
-    )
-
-    def extract_from_code(self, content: str, symbol: Symbol) -> Optional[str]:
-        """从代码中提取docstring"""
-
-        lines = content.splitlines()
-        start_line = symbol.range[0] - 1  # 0-indexed
-
-        # 查找函数定义后的第一个字符串字面量
-        # 通常在函数定义的下一行或几行内
-        for i in range(start_line + 1, min(start_line + 10, len(lines))):
-            line = lines[i].strip()
-
-            # Python triple-quoted string
-            if line.startswith('"""') or line.startswith("'''"):
-                return self._extract_multiline_docstring(lines, i)
-
-        return None
-
-    def _extract_multiline_docstring(
-        self,
-        lines: List[str],
-        start_idx: int
-    ) -> str:
-        """提取多行docstring"""
-
-        quote_char = '"""' if lines[start_idx].strip().startswith('"""') else "'''"
-        docstring_lines = []
-
-        # 检查是否单行docstring
-        first_line = lines[start_idx].strip()
-        if first_line.count(quote_char) == 2:
-            # 单行: """This is a docstring."""
-            return first_line.strip(quote_char).strip()
-
-        # 多行docstring
-        in_docstring = True
-        for i in range(start_idx, len(lines)):
-            line = lines[i]
-
-            if i == start_idx:
-                # 第一行：移除开始的引号
-                docstring_lines.append(line.strip().lstrip(quote_char))
-            elif quote_char in line:
-                # 结束行：移除结束的引号
-                docstring_lines.append(line.strip().rstrip(quote_char))
-                break
-            else:
-                docstring_lines.append(line.strip())
-
-        return '\n'.join(docstring_lines).strip()
-
-    def parse_docstring(self, raw_docstring: str) -> DocstringMetadata:
-        """解析docstring，提取结构化信息"""
-
-        if not raw_docstring:
-            return DocstringMetadata(
-                raw_text="",
-                quality=DocstringQuality.MISSING
-            )
-
-        # 评估质量
-        quality = self._evaluate_quality(raw_docstring)
-
-        # 提取各个部分
-        metadata = DocstringMetadata(
-            raw_text=raw_docstring,
-            quality=quality,
-        )
-
-        # 提取摘要（第一行或第一段）
-        metadata.summary = self._extract_summary(raw_docstring)
-
-        # 如果是Google或NumPy风格，提取结构化内容
-        if self.GOOGLE_STYLE_PATTERN.search(raw_docstring):
-            self._parse_google_style(raw_docstring, metadata)
-        elif self.NUMPY_STYLE_PATTERN.search(raw_docstring):
-            self._parse_numpy_style(raw_docstring, metadata)
-
-        return metadata
-
-    def _evaluate_quality(self, docstring: str) -> DocstringQuality:
-        """评估docstring质量"""
-
-        if not docstring or len(docstring.strip()) == 0:
-            return DocstringQuality.MISSING
-
-        # 检查是否是占位符
-        placeholders = ['todo', 'fixme', 'tbd', 'placeholder', '...']
-        if any(p in docstring.lower() for p in placeholders):
-            return DocstringQuality.LOW
-
-        # 长度检查
-        if len(docstring.strip()) < 10:
-            return DocstringQuality.LOW
-
-        # 检查是否有结构化内容
-        has_structure = (
-            self.GOOGLE_STYLE_PATTERN.search(docstring) or
-            self.NUMPY_STYLE_PATTERN.search(docstring)
-        )
-
-        # 检查是否有足够的描述性文本
-        word_count = len(docstring.split())
-
-        if has_structure and word_count >= 20:
-            return DocstringQuality.HIGH
-        elif word_count >= 10:
-            return DocstringQuality.MEDIUM
-        else:
-            return DocstringQuality.LOW
-
-    def _extract_summary(self, docstring: str) -> str:
-        """提取摘要（第一行或第一段）"""
-
-        lines = docstring.split('\n')
-        # 第一行非空行作为摘要
-        for line in lines:
-            if line.strip():
-                return line.strip()
-
-        return ""
-
-    def _parse_google_style(self, docstring: str, metadata: DocstringMetadata):
-        """解析Google风格docstring"""
-
-        # 提取Args
-        args_match = re.search(r'Args:(.*?)(?=Returns:|Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
-        if args_match:
-            metadata.parameters = self._parse_args_section(args_match.group(1))
-
-        # 提取Returns
-        returns_match = re.search(r'Returns:(.*?)(?=Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
-        if returns_match:
-            metadata.returns = returns_match.group(1).strip()
-
-        # 提取Examples
-        examples_match = re.search(r'Examples:(.*?)(?=Note:|\Z)', docstring, re.DOTALL)
-        if examples_match:
-            metadata.examples = examples_match.group(1).strip()
-
-    def _parse_args_section(self, args_text: str) -> dict:
-        """解析参数列表"""
-
-        params = {}
-        # 匹配 "param_name (type): description" 或 "param_name: description"
-        pattern = re.compile(r'(\w+)\s*(?:\(([^)]+)\))?\s*:\s*(.+)')
-
-        for line in args_text.split('\n'):
-            match = pattern.search(line.strip())
-            if match:
-                param_name, param_type, description = match.groups()
-                params[param_name] = {
-                    'type': param_type,
-                    'description': description.strip()
-                }
-
-        return params
-```
-
-### 3.2 智能混合策略引擎
-
-```python
-class HybridEnhancer:
-    """Docstring与LLM混合增强器"""
-
-    def __init__(
-        self,
-        llm_enhancer: LLMEnhancer,
-        docstring_extractor: DocstringExtractor
-    ):
-        self.llm_enhancer = llm_enhancer
-        self.docstring_extractor = docstring_extractor
-
-    def enhance_with_strategy(
-        self,
-        file_data: FileData,
-        symbols: List[Symbol]
-    ) -> Dict[str, SemanticMetadata]:
-        """根据docstring质量选择增强策略"""
-
-        results = {}
-
-        for symbol in symbols:
-            # 1. 提取并解析docstring
-            raw_docstring = self.docstring_extractor.extract_from_code(
-                file_data.content, symbol
-            )
-            doc_metadata = self.docstring_extractor.parse_docstring(raw_docstring or "")
-
-            # 2. 根据质量选择策略
-            semantic_metadata = self._apply_strategy(
-                file_data, symbol, doc_metadata
-            )
-
-            results[symbol.name] = semantic_metadata
-
-        return results
-
-    def _apply_strategy(
-        self,
-        file_data: FileData,
-        symbol: Symbol,
-        doc_metadata: DocstringMetadata
-    ) -> SemanticMetadata:
-        """应用混合策略"""
-
-        quality = doc_metadata.quality
-
-        if quality == DocstringQuality.HIGH:
-            # 高质量：直接使用docstring，只用LLM生成keywords
-            return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
-
-        elif quality == DocstringQuality.MEDIUM:
-            # 中等质量：让LLM精炼和增强
-            return self._refine_with_llm(file_data, symbol, doc_metadata)
-
-        else:  # LOW or MISSING
-            # 低质量或无：完全由LLM生成
-            return self._full_llm_generation(file_data, symbol)
-
-    def _use_docstring_with_llm_keywords(
-        self,
-        symbol: Symbol,
-        doc_metadata: DocstringMetadata
-    ) -> SemanticMetadata:
-        """策略1：使用docstring，LLM只生成keywords"""
-
-        # 直接使用docstring的摘要
-        summary = doc_metadata.summary or doc_metadata.raw_text[:200]
-
-        # 使用LLM生成keywords
-        keywords = self._generate_keywords_only(summary, symbol.name)
-
-        # 从docstring推断purpose
-        purpose = self._infer_purpose_from_docstring(doc_metadata)
-
-        return SemanticMetadata(
-            summary=summary,
-            keywords=keywords,
-            purpose=purpose,
-            file_path=symbol.file_path if hasattr(symbol, 'file_path') else None,
-            symbol_name=symbol.name,
-            llm_tool="hybrid_docstring_primary",
-        )
-
-    def _refine_with_llm(
-        self,
-        file_data: FileData,
-        symbol: Symbol,
-        doc_metadata: DocstringMetadata
-    ) -> SemanticMetadata:
-        """策略2：让LLM精炼和增强docstring"""
-
-        prompt = f"""
-PURPOSE: Refine and enhance an existing docstring for better semantic search
-TASK:
- Review the existing docstring
- Generate a concise summary (1-2 sentences) that captures the core purpose
- Extract 8-12 relevant keywords for search
- Identify the functional category/purpose
-
-EXISTING DOCSTRING:
-{doc_metadata.raw_text}
-
-CODE CONTEXT:
-Function: {symbol.name}
-```{file_data.language}
-{self._get_symbol_code(file_data.content, symbol)}
-```
-
-OUTPUT: JSON format
-{{
-    "summary": "refined summary based on docstring and code",
-    "keywords": ["keyword1", "keyword2", ...],
-    "purpose": "category"
-}}
-"""
-
-        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
-        if response['success']:
-            data = json.loads(self.llm_enhancer._extract_json(response['stdout']))
-            return SemanticMetadata(
-                summary=data.get('summary', doc_metadata.summary),
-                keywords=data.get('keywords', []),
-                purpose=data.get('purpose', 'unknown'),
-                file_path=file_data.path,
-                symbol_name=symbol.name,
-                llm_tool="hybrid_llm_refined",
-            )
-
-        # Fallback: 使用docstring
-        return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
-
-    def _full_llm_generation(
-        self,
-        file_data: FileData,
-        symbol: Symbol
-    ) -> SemanticMetadata:
-        """策略3：完全由LLM生成（原有流程）"""
-
-        # 复用现有的LLM enhancer
-        code_snippet = self._get_symbol_code(file_data.content, symbol)
-
-        results = self.llm_enhancer.enhance_files([
-            FileData(
-                path=f"{file_data.path}:{symbol.name}",
-                content=code_snippet,
-                language=file_data.language
-            )
-        ])
-
-        return results.get(f"{file_data.path}:{symbol.name}", SemanticMetadata(
-            summary="",
-            keywords=[],
-            purpose="unknown",
-            file_path=file_data.path,
-            symbol_name=symbol.name,
-            llm_tool="hybrid_llm_full",
-        ))
-
-    def _generate_keywords_only(self, summary: str, symbol_name: str) -> List[str]:
-        """仅生成keywords（快速LLM调用）"""
-
-        prompt = f"""
-PURPOSE: Generate search keywords for a code function
-TASK: Extract 5-8 relevant keywords from the summary
-
-Summary: {summary}
-Function Name: {symbol_name}
-
-OUTPUT: Comma-separated keywords
-"""
-
-        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
-        if response['success']:
-            keywords_str = response['stdout'].strip()
-            return [k.strip() for k in keywords_str.split(',')]
-
-        # Fallback: 从摘要提取关键词
-        return self._extract_keywords_heuristic(summary)
-
-    def _extract_keywords_heuristic(self, text: str) -> List[str]:
-        """启发式关键词提取（无需LLM）"""
-
-        # 简单实现：提取名词性词组
-        import re
-        words = re.findall(r'\b[a-z]{4,}\b', text.lower())
-
-        # 过滤常见词
-        stopwords = {'this', 'that', 'with', 'from', 'have', 'will', 'your', 'their'}
-        keywords = [w for w in words if w not in stopwords]
-
-        return list(set(keywords))[:8]
-
-    def _infer_purpose_from_docstring(self, doc_metadata: DocstringMetadata) -> str:
-        """从docstring推断purpose（无需LLM）"""
-
-        summary = doc_metadata.summary.lower()
-
-        # 简单规则匹配
-        if 'authenticate' in summary or 'login' in summary:
-            return 'auth'
-        elif 'validate' in summary or 'check' in summary:
-            return 'validation'
-        elif 'parse' in summary or 'format' in summary:
-            return 'data_processing'
-        elif 'api' in summary or 'endpoint' in summary:
-            return 'api'
-        elif 'database' in summary or 'query' in summary:
-            return 'data'
-        elif 'test' in summary:
-            return 'test'
-
-        return 'util'
-
-    def _get_symbol_code(self, content: str, symbol: Symbol) -> str:
-        """提取符号的代码"""
-
-        lines = content.splitlines()
-        start, end = symbol.range
-        return '\n'.join(lines[start-1:end])
-```
-
-### 3.3 成本优化统计
-
-```python
-@dataclass
-class EnhancementStats:
-    """增强统计"""
-    total_symbols: int = 0
-    used_docstring_only: int = 0      # 只使用docstring
-    llm_keywords_only: int = 0        # LLM只生成keywords
-    llm_refined: int = 0              # LLM精炼docstring
-    llm_full_generation: int = 0      # LLM完全生成
-    total_llm_calls: int = 0
-    estimated_cost_savings: float = 0.0  # 相比全用LLM节省的成本
-
-class CostOptimizedEnhancer(HybridEnhancer):
-    """带成本统计的增强器"""
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.stats = EnhancementStats()
-
-    def enhance_with_strategy(
-        self,
-        file_data: FileData,
-        symbols: List[Symbol]
-    ) -> Dict[str, SemanticMetadata]:
-        """增强并统计成本"""
-
-        self.stats.total_symbols += len(symbols)
-        results = super().enhance_with_strategy(file_data, symbols)
-
-        # 统计各策略使用情况
-        for metadata in results.values():
-            if metadata.llm_tool == "hybrid_docstring_primary":
-                self.stats.used_docstring_only += 1
-                self.stats.llm_keywords_only += 1
-                self.stats.total_llm_calls += 1
-            elif metadata.llm_tool == "hybrid_llm_refined":
-                self.stats.llm_refined += 1
-                self.stats.total_llm_calls += 1
-            elif metadata.llm_tool == "hybrid_llm_full":
-                self.stats.llm_full_generation += 1
-                self.stats.total_llm_calls += 1
-
-        # 计算成本节省（假设：keywords-only调用成本为full的20%）
-        keywords_only_savings = self.stats.llm_keywords_only * 0.8  # 节省80%
-        full_generation_count = self.stats.total_symbols - self.stats.llm_keywords_only
-        self.stats.estimated_cost_savings = keywords_only_savings / full_generation_count if full_generation_count > 0 else 0
-
-        return results
-
-    def print_stats(self):
-        """打印统计信息"""
-
-        print("=== Enhancement Statistics ===")
-        print(f"Total Symbols: {self.stats.total_symbols}")
-        print(f"Used Docstring (with LLM keywords): {self.stats.used_docstring_only} ({self.stats.used_docstring_only/self.stats.total_symbols*100:.1f}%)")
-        print(f"LLM Refined Docstring: {self.stats.llm_refined} ({self.stats.llm_refined/self.stats.total_symbols*100:.1f}%)")
-        print(f"LLM Full Generation: {self.stats.llm_full_generation} ({self.stats.llm_full_generation/self.stats.total_symbols*100:.1f}%)")
-        print(f"Total LLM Calls: {self.stats.total_llm_calls}")
-        print(f"Estimated Cost Savings: {self.stats.estimated_cost_savings*100:.1f}%")
-```
-
-## 4. 配置选项
-
-```python
-@dataclass
-class HybridEnhancementConfig:
-    """混合增强配置"""
-
-    # 是否启用混合策略（False则回退到全LLM模式）
-    enable_hybrid: bool = True
-
-    # 质量阈值配置
-    use_docstring_threshold: DocstringQuality = DocstringQuality.HIGH
-    refine_docstring_threshold: DocstringQuality = DocstringQuality.MEDIUM
-
-    # 是否为高质量docstring生成keywords
-    generate_keywords_for_docstring: bool = True
-
-    # LLM配置
-    llm_tool: str = "gemini"
-    llm_timeout: int = 300000
-
-    # 成本优化
-    batch_size: int = 5              # 批量处理大小
-    skip_test_files: bool = True     # 跳过测试文件（通常docstring较少）
-
-    # 调试选项
-    log_strategy_decisions: bool = False  # 记录策略决策日志
-```
-
-## 5. 测试策略
-
-### 5.1 单元测试
-
-```python
-import pytest
-
-class TestDocstringExtractor:
-    """测试docstring提取"""
-
-    def test_extract_google_style(self):
-        """测试Google风格docstring提取"""
-        code = '''
-def calculate_total(items, discount=0):
-    """Calculate total price with optional discount.
-
-    This function processes a list of items and applies
-    a discount if specified.
-
-    Args:
-        items (list): List of item objects with price attribute.
-        discount (float): Discount percentage (0-1). Defaults to 0.
-
-    Returns:
-        float: Total price after discount.
-
-    Examples:
-        >>> calculate_total([item1, item2], discount=0.1)
-        90.0
-    """
-    total = sum(item.price for item in items)
-    return total * (1 - discount)
-'''
-        extractor = DocstringExtractor()
-        symbol = Symbol(name='calculate_total', kind='function', range=(1, 18))
-        docstring = extractor.extract_from_code(code, symbol)
-
-        assert docstring is not None
-        metadata = extractor.parse_docstring(docstring)
-
-        assert metadata.quality == DocstringQuality.HIGH
-        assert 'Calculate total price' in metadata.summary
-        assert metadata.parameters is not None
-        assert 'items' in metadata.parameters
-        assert metadata.returns is not None
-        assert metadata.examples is not None
-
-    def test_extract_low_quality_docstring(self):
-        """测试低质量docstring识别"""
-        code = '''
-def process():
-    """TODO"""
-    pass
-'''
-        extractor = DocstringExtractor()
-        symbol = Symbol(name='process', kind='function', range=(1, 3))
-        docstring = extractor.extract_from_code(code, symbol)
-
-        metadata = extractor.parse_docstring(docstring)
-        assert metadata.quality == DocstringQuality.LOW
-
-class TestHybridEnhancer:
-    """测试混合增强器"""
-
-    def test_high_quality_docstring_strategy(self):
-        """测试高质量docstring使用策略"""
-
-        extractor = DocstringExtractor()
-        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
-        hybrid = HybridEnhancer(llm_enhancer, extractor)
-
-        # 模拟高质量docstring
-        doc_metadata = DocstringMetadata(
-            raw_text="Validate user credentials against database.",
-            quality=DocstringQuality.HIGH,
-            summary="Validate user credentials against database."
-        )
-
-        symbol = Symbol(name='validate_user', kind='function', range=(1, 10))
-
-        result = hybrid._use_docstring_with_llm_keywords(symbol, doc_metadata)
-
-        # 应该使用docstring的摘要
-        assert result.summary == doc_metadata.summary
-        # 应该有keywords（可能由LLM或启发式生成）
-        assert len(result.keywords) > 0
-
-    def test_cost_optimization(self):
-        """测试成本优化效果"""
-
-        enhancer = CostOptimizedEnhancer(
-            llm_enhancer=LLMEnhancer(LLMConfig(enabled=False)),  # Mock
-            docstring_extractor=DocstringExtractor()
-        )
-
-        # 模拟处理10个symbol，其中5个有高质量docstring
-        # 预期：5个只调用keywords生成，5个完整LLM
-        # 总调用10次，但成本降低（keywords调用更便宜）
-
-        # 实际测试需要mock LLM调用
-        pass
-```
-
-### 5.2 集成测试
-
-```python
-class TestHybridEnhancementPipeline:
-    """测试完整的混合增强流程"""
-
-    def test_full_pipeline(self):
-        """测试完整流程：代码 -> docstring提取 -> 质量评估 -> 策略选择 -> 增强"""
-
-        code = '''
-def authenticate_user(username, password):
-    """Authenticate user with username and password.
-
-    Args:
-        username (str): User's username
-        password (str): User's password
-
-    Returns:
-        bool: True if authenticated, False otherwise
-    """
-    # ... implementation
-    pass
-
-def helper_func(x):
-    # No docstring
-    return x * 2
-'''
-
-        file_data = FileData(path='auth.py', content=code, language='python')
-        symbols = [
-            Symbol(name='authenticate_user', kind='function', range=(1, 11)),
-            Symbol(name='helper_func', kind='function', range=(13, 15)),
-        ]
-
-        extractor = DocstringExtractor()
-        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
-        hybrid = CostOptimizedEnhancer(llm_enhancer, extractor)
-
-        results = hybrid.enhance_with_strategy(file_data, symbols)
-
-        # authenticate_user 应该使用docstring
-        assert results['authenticate_user'].llm_tool == "hybrid_docstring_primary"
-
-        # helper_func 应该完全LLM生成
-        assert results['helper_func'].llm_tool == "hybrid_llm_full"
-
-        # 统计
-        assert hybrid.stats.total_symbols == 2
-        assert hybrid.stats.used_docstring_only >= 1
-        assert hybrid.stats.llm_full_generation >= 1
-```
-
-## 6. 实施路线图
-
-### Phase 1: 基础设施（1周）
- [x] 设计数据结构（DocstringMetadata, DocstringQuality）
- [ ] 实现DocstringExtractor（提取和解析）
- [ ] 支持Python docstring（Google/NumPy/reStructuredText风格）
- [ ] 单元测试
-
-### Phase 2: 质量评估（1周）
- [ ] 实现质量评估算法
- [ ] 启发式规则优化
- [ ] 测试不同质量的docstring
- [ ] 调整阈值参数
-
-### Phase 3: 混合策略（1-2周）
- [ ] 实现HybridEnhancer
- [ ] 三种策略实现（docstring-only, refine, full-llm）
- [ ] 策略选择逻辑
- [ ] 集成测试
-
-### Phase 4: 成本优化（1周）
- [ ] 实现CostOptimizedEnhancer
- [ ] 统计和监控
- [ ] 批量处理优化
- [ ] 性能测试
-
-### Phase 5: 多语言支持（1-2周）
- [ ] JavaScript/TypeScript JSDoc
- [ ] Java Javadoc
- [ ] 其他语言docstring格式
-
-### Phase 6: 集成与部署（1周）
- [ ] 集成到现有llm_enhancer
- [ ] CLI选项暴露
- [ ] 配置文件支持
- [ ] 文档和示例
-
-**总计预估时间**：6-8周
-
-## 7. 性能与成本分析
-
-### 7.1 预期成本节省
-
-假设场景：分析1000个函数
-
-| Docstring质量分布 | 占比 | LLM调用策略 | 相对成本 |
-|------------------|------|------------|---------|
-| High (有详细docstring) | 30% | 只生成keywords | 20% |
-| Medium (有基本docstring) | 40% | 精炼增强 | 60% |
-| Low/Missing | 30% | 完全生成 | 100% |
-
-**总成本计算**：
- 纯LLM模式：1000 * 100% = 1000 units
- 混合模式：300*20% + 400*60% + 300*100% = 60 + 240 + 300 = 600 units
- **节省**：40%
-
-### 7.2 质量对比
-
-| 指标 | 纯LLM模式 | 混合模式 |
-|------|----------|---------|
-| 准确性 | 中（可能有幻觉） | **高**（docstring权威） |
-| 一致性 | 中（依赖prompt） | **高**（保留作者风格） |
-| 覆盖率 | **高**（全覆盖） | 高（98%+） |
-| 成本 | 高 | **低**（节省40%） |
-| 速度 | 慢（所有文件） | **快**（减少LLM调用） |
-
-## 8. 潜在问题与解决方案
-
-### 8.1 问题：Docstring过时
-
-**现象**：代码已修改，但docstring未更新，导致信息不准确。
-
-**解决方案**：
-```python
-class DocstringFreshnessChecker:
-    """检查docstring与代码的一致性"""
-
-    def check_freshness(
-        self,
-        symbol: Symbol,
-        code: str,
-        doc_metadata: DocstringMetadata
-    ) -> bool:
-        """检查docstring是否与代码匹配"""
-
-        # 检查1: 参数列表是否匹配
-        if doc_metadata.parameters:
-            actual_params = self._extract_actual_parameters(code)
-            documented_params = set(doc_metadata.parameters.keys())
-
-            if actual_params != documented_params:
-                logger.warning(
-                    f"Parameter mismatch in {symbol.name}: "
-                    f"code has {actual_params}, doc has {documented_params}"
-                )
-                return False
-
-        # 检查2: 使用LLM验证一致性
-        # TODO: 构建验证prompt
-
-        return True
-```
-
-### 8.2 问题：不同docstring风格混用
-
-**现象**：同一项目中使用多种docstring风格（Google, NumPy, 自定义）。
-
-**解决方案**：
-```python
-class MultiStyleDocstringParser:
-    """支持多种docstring风格的解析器"""
-
-    def parse(self, docstring: str) -> DocstringMetadata:
-        """自动检测并解析不同风格"""
-
-        # 尝试各种解析器
-        for parser in [
-            GoogleStyleParser(),
-            NumpyStyleParser(),
-            ReStructuredTextParser(),
-            SimpleParser(),  # Fallback
-        ]:
-            try:
-                metadata = parser.parse(docstring)
-                if metadata.quality != DocstringQuality.LOW:
-                    return metadata
-            except Exception:
-                continue
-
-        # 如果所有解析器都失败，返回简单解析结果
-        return SimpleParser().parse(docstring)
-```
-
-### 8.3 问题：多语言docstring提取差异
-
-**现象**：不同语言的docstring格式和位置不同。
-
-**解决方案**：
-```python
-class LanguageSpecificExtractor:
-    """语言特定的docstring提取器"""
-
-    def extract(self, language: str, code: str, symbol: Symbol) -> Optional[str]:
-        """根据语言选择合适的提取器"""
-
-        extractors = {
-            'python': PythonDocstringExtractor(),
-            'javascript': JSDocExtractor(),
-            'typescript': TSDocExtractor(),
-            'java': JavadocExtractor(),
-        }
-
-        extractor = extractors.get(language, GenericExtractor())
-        return extractor.extract(code, symbol)
-
-class JSDocExtractor:
-    """JavaScript/TypeScript JSDoc提取器"""
-
-    def extract(self, code: str, symbol: Symbol) -> Optional[str]:
-        """提取JSDoc注释"""
-
-        lines = code.splitlines()
-        start_line = symbol.range[0] - 1
-
-        # 向上查找 /** ... */ 注释
-        for i in range(start_line - 1, max(0, start_line - 20), -1):
-            if '*/' in lines[i]:
-                # 找到结束标记，向上提取
-                return self._extract_jsdoc_block(lines, i)
-
-        return None
-```
-
-## 9. 配置示例
-
-### 9.1 配置文件
-
-```yaml
-# .codexlens/hybrid_enhancement.yaml
-
-hybrid_enhancement:
-  enabled: true
-
-  # 质量阈值
-  quality_thresholds:
-    use_docstring: high      # high/medium/low
-    refine_docstring: medium
-
-  # LLM选项
-  llm:
-    tool: gemini
-    fallback: qwen
-    timeout_ms: 300000
-    batch_size: 5
-
-  # 成本优化
-  cost_optimization:
-    generate_keywords_for_docstring: true
-    skip_test_files: true
-    skip_private_methods: false
-
-  # 语言支持
-  languages:
-    python:
-      styles: [google, numpy, sphinx]
-    javascript:
-      styles: [jsdoc]
-    java:
-      styles: [javadoc]
-
-  # 监控
-  logging:
-    log_strategy_decisions: false
-    log_cost_savings: true
-```
-
-### 9.2 CLI使用
-
-```bash
-# 使用混合策略增强
-codex-lens enhance . --hybrid --tool gemini
-
-# 查看成本统计
-codex-lens enhance . --hybrid --show-stats
-
-# 仅对高质量docstring生成keywords
-codex-lens enhance . --hybrid --keywords-only
-
-# 禁用混合模式，回退到纯LLM
-codex-lens enhance . --no-hybrid --tool gemini
-```
-
-## 10. 成功指标
-
-1. **成本节省**：相比纯LLM模式，降低API调用成本40%+
-2. **准确性提升**：使用docstring的符号，元数据准确率>95%
-3. **覆盖率**：98%+的符号有语义元数据（docstring或LLM生成）
-4. **速度提升**：整体处理速度提升30%+（减少LLM调用）
-5. **用户满意度**：保留docstring信息，开发者认可度高
-
-## 11. 参考资料
-
- [PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
- [Google Python Style Guide - Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
- [NumPy Docstring Standard](https://numpydoc.readthedocs.io/en/latest/format.html)
- [JSDoc Documentation](https://jsdoc.app/)
- [Javadoc Tool](https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html)
--- a/codex-lens/docs/IMPLEMENTATION_SUMMARY.md
+++ b/codex-lens/docs/IMPLEMENTATION_SUMMARY.md
@@ -394,52 +394,32 @@ results = engine.search(
  - 指导用户如何生成嵌入
  - 集成到搜索引擎日志中

-### ✅ LLM语义增强验证 (2025-12-16)
+### ❌ LLM语义增强功能已移除 (2025-12-16)

-**测试目标**: 验证LLM增强的向量搜索是否正常工作，对比纯向量搜索效果
+**移除原因**: 简化代码库，减少外部依赖

-**测试基础设施**:
- 创建测试套件 `tests/test_llm_enhanced_search.py` (550+ lines)
- 创建独立测试脚本 `scripts/compare_search_methods.py` (460+ lines)
- 创建完整文档 `docs/LLM_ENHANCED_SEARCH_GUIDE.md` (460+ lines)
+**已移除内容**:
+- `src/codexlens/semantic/llm_enhancer.py` - LLM增强核心模块
+- `src/codexlens/cli/commands.py` 中的 `enhance` 命令
+- `tests/test_llm_enhancer.py` - LLM增强测试
+- `tests/test_llm_enhanced_search.py` - LLM对比测试
+- `scripts/compare_search_methods.py` - 对比测试脚本
+- `scripts/test_misleading_comments.py` - 误导性注释测试
+- `scripts/show_llm_analysis.py` - LLM分析展示脚本
+- `scripts/inspect_llm_summaries.py` - LLM摘要检查工具
+- `docs/LLM_ENHANCED_SEARCH_GUIDE.md` - LLM使用指南
+- `docs/LLM_ENHANCEMENT_TEST_RESULTS.md` - LLM测试结果
+- `docs/MISLEADING_COMMENTS_TEST_RESULTS.md` - 误导性注释测试结果
+- `docs/CLI_INTEGRATION_SUMMARY.md` - CLI集成文档（包含enhance命令）
+- `docs/DOCSTRING_LLM_HYBRID_DESIGN.md` - LLM混合策略设计

-**测试数据**:
- 5个真实Python代码样本 (认证、API、验证、数据库)
- 6个自然语言测试查询
- 涵盖密码哈希、JWT令牌、用户API、邮箱验证、数据库连接等场景
+**保留功能**:
+- ✅ 纯向量搜索 (pure_vector) 完整保留
+- ✅ 语义嵌入生成 (`codexlens embeddings-generate`)
+- ✅ 语义嵌入状态检查 (`codexlens embeddings-status`)
+- ✅ 所有核心搜索功能

-**测试结果** (2025-12-16):
-```
-数据集: 5个Python文件, 5个查询
-测试工具: Gemini Flash 2.5
-
-Setup Time:
-  - Pure Vector:    2.3秒  (直接嵌入代码)
-  - LLM-Enhanced: 174.2秒  (通过Gemini生成摘要, 75x slower)
-
-Accuracy:
-  - Pure Vector:    5/5 (100%) - 所有查询Rank 1
-  - LLM-Enhanced:   5/5 (100%) - 所有查询Rank 1
-  - Score:         15 vs 15 (平局)
-```
-
-**关键发现**:
-1. ✅ **LLM增强功能正常工作**
-   - CCW CLI集成正常
-   - Gemini API调用成功
-   - 摘要生成和嵌入创建正常
-
-2. **性能权衡**
-   - 索引阶段慢75倍 (LLM API调用开销)
-   - 查询阶段速度相同 (都是向量相似度搜索)
-   - 适合离线索引，在线查询场景
-
-3. **准确性**
-   - 测试数据集太简单 (5文件，完美1:1映射)
-   - 两种方法都达到100%准确率
-   - 需要更大、更复杂的代码库来显示差异
-
-**结论**: LLM语义增强功能已验证可正常工作，可用于生产环境
+**历史记录**: LLM增强功能在测试中表现良好，但为简化维护和减少外部依赖（CCW CLI, Gemini/Qwen API）而移除。设计文档（DESIGN_EVALUATION_REPORT.md等）保留作为历史参考。

 ### P2 - 中期（1-2月）

--- a/codex-lens/docs/LLM_ENHANCED_SEARCH_GUIDE.md
+++ b/codex-lens/docs/LLM_ENHANCED_SEARCH_GUIDE.md
@@ -1,463 +0,0 @@
-# LLM-Enhanced Semantic Search Guide
-
-**Last Updated**: 2025-12-16
-**Status**: Experimental Feature
-
---
-
-## Overview
-
-CodexLens supports two approaches for semantic vector search:
-
-| Approach | Pipeline | Best For |
-|----------|----------|----------|
-| **Pure Vector** | Code → fastembed → search | Code pattern matching, exact functionality |
-| **LLM-Enhanced** | Code → LLM summary → fastembed → search | Natural language queries, conceptual search |
-
-### Why LLM Enhancement?
-
-**Problem**: Raw code embeddings don't match natural language well.
-
-```
-Query: "How do I hash passwords securely?"
-Raw code: def hash_password(password: str) -> str: ...
-Mismatch: Low semantic similarity
-```
-
-**Solution**: LLM generates natural language summaries.
-
-```
-Query: "How do I hash passwords securely?"
-LLM Summary: "Hash a password using bcrypt with specified salt rounds for secure storage"
-Match: High semantic similarity ✓
-```
-
-## Architecture
-
-### Pure Vector Search Flow
-
-```
-1. Code File
-   └→ "def hash_password(password: str): ..."
-
-2. Chunking
-   └→ Split into semantic chunks (500-2000 chars)
-
-3. Embedding (fastembed)
-   └→ Generate 768-dim vector from raw code
-
-4. Storage
-   └→ Store vector in semantic_chunks table
-
-5. Query
-   └→ "How to hash passwords"
-   └→ Generate query vector
-   └→ Find similar vectors (cosine similarity)
-```
-
-**Pros**: Fast, no external dependencies, good for code patterns
-**Cons**: Poor semantic match for natural language queries
-
-### LLM-Enhanced Search Flow
-
-```
-1. Code File
-   └→ "def hash_password(password: str): ..."
-
-2. LLM Analysis (Gemini/Qwen via CCW)
-   └→ Generate summary: "Hash a password using bcrypt..."
-   └→ Extract keywords: ["password", "hash", "bcrypt", "security"]
-   └→ Identify purpose: "auth"
-
-3. Embeddable Text Creation
-   └→ Combine: summary + keywords + purpose + filename
-
-4. Embedding (fastembed)
-   └→ Generate 768-dim vector from LLM text
-
-5. Storage
-   └→ Store vector with metadata
-
-6. Query
-   └→ "How to hash passwords"
-   └→ Generate query vector
-   └→ Find similar vectors → Better match! ✓
-```
-
-**Pros**: Excellent semantic match for natural language
-**Cons**: Slower, requires CCW CLI and LLM access
-
-## Setup Requirements
-
-### 1. Install Dependencies
-
-```bash
-# Install semantic search dependencies
-pip install codexlens[semantic]
-
-# Install CCW CLI for LLM enhancement
-npm install -g ccw
-```
-
-### 2. Configure LLM Tools
-
-```bash
-# Set primary LLM tool (default: gemini)
-export CCW_CLI_SECONDARY_TOOL=gemini
-
-# Set fallback tool (default: qwen)
-export CCW_CLI_FALLBACK_TOOL=qwen
-
-# Configure API keys (see CCW documentation)
-ccw config set gemini.apiKey YOUR_API_KEY
-```
-
-### 3. Verify Setup
-
-```bash
-# Check CCW availability
-ccw --version
-
-# Check semantic dependencies
-python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"
-```
-
-## Running Comparison Tests
-
-### Method 1: Standalone Script (Recommended)
-
-```bash
-# Run full comparison (pure vector + LLM-enhanced)
-python scripts/compare_search_methods.py
-
-# Use specific LLM tool
-python scripts/compare_search_methods.py --tool gemini
-python scripts/compare_search_methods.py --tool qwen
-
-# Skip LLM test (only pure vector)
-python scripts/compare_search_methods.py --skip-llm
-```
-
-**Output Example**:
-
-```
-======================================================================
-SEMANTIC SEARCH COMPARISON TEST
-Pure Vector vs LLM-Enhanced Vector Search
-======================================================================
-
-Test dataset: 5 Python files
-Test queries: 5 natural language questions
-
-======================================================================
-PURE VECTOR SEARCH (Code → fastembed)
-======================================================================
-Setup: 5 files, 23 chunks in 2.3s
-
-Query                                        Top Result                     Score
----------------------------------------------------------------------
-✓ How do I securely hash passwords?         password_hasher.py             0.723
-✗ Generate JWT token for authentication      user_endpoints.py              0.645
-✓ Create new user account via API            user_endpoints.py              0.812
-✓ Validate email address format              validation.py                  0.756
-~ Connect to PostgreSQL database             connection.py                  0.689
-
-======================================================================
-LLM-ENHANCED SEARCH (Code → GEMINI → fastembed)
-======================================================================
-Generating LLM summaries for 5 files...
-Setup: 5/5 files indexed in 8.7s
-
-Query                                        Top Result                     Score
----------------------------------------------------------------------
-✓ How do I securely hash passwords?         password_hasher.py             0.891
-✓ Generate JWT token for authentication      jwt_handler.py                 0.867
-✓ Create new user account via API            user_endpoints.py              0.923
-✓ Validate email address format              validation.py                  0.845
-✓ Connect to PostgreSQL database             connection.py                  0.801
-
-======================================================================
-COMPARISON SUMMARY
-======================================================================
-
-Query                                        Pure       LLM
----------------------------------------------------------------------
-How do I securely hash passwords?           ✓ Rank 1   ✓ Rank 1
-Generate JWT token for authentication        ✗ Miss     ✓ Rank 1
-Create new user account via API              ✓ Rank 1   ✓ Rank 1
-Validate email address format                ✓ Rank 1   ✓ Rank 1
-Connect to PostgreSQL database               ~ Rank 2   ✓ Rank 1
----------------------------------------------------------------------
-TOTAL SCORE                                  11         15
-======================================================================
-
-ANALYSIS:
-✓ LLM enhancement improves results by 36.4%
-  Natural language summaries match queries better than raw code
-```
-
-### Method 2: Pytest Test Suite
-
-```bash
-# Run full test suite
-pytest tests/test_llm_enhanced_search.py -v -s
-
-# Run specific test
-pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s
-
-# Skip LLM tests if CCW not available
-pytest tests/test_llm_enhanced_search.py -v -s -k "not llm_enhanced"
-```
-
-## Using LLM Enhancement in Production
-
-### Option 1: Enhanced Embeddings Generation (Recommended)
-
-Create embeddings with LLM enhancement during indexing:
-
-```python
-from pathlib import Path
-from codexlens.semantic.llm_enhancer import create_enhanced_indexer, FileData
-
-# Create enhanced indexer
-indexer = create_enhanced_indexer(
-    vector_store_path=Path("~/.codexlens/indexes/project/_index.db"),
-    llm_tool="gemini",
-    llm_enabled=True,
-)
-
-# Prepare file data
-files = [
-    FileData(
-        path="auth/password_hasher.py",
-        content=open("auth/password_hasher.py").read(),
-        language="python"
-    ),
-    # ... more files
-]
-
-# Index with LLM enhancement
-indexed_count = indexer.index_files(files)
-print(f"Indexed {indexed_count} files with LLM enhancement")
-```
-
-### Option 2: CLI Integration (Coming Soon)
-
-```bash
-# Generate embeddings with LLM enhancement
-codexlens embeddings-generate ~/projects/my-app --llm-enhanced --tool gemini
-
-# Check which strategy was used
-codexlens embeddings-status ~/projects/my-app --show-strategies
-```
-
-**Note**: CLI integration is planned but not yet implemented. Currently use Option 1 (Python API).
-
-### Option 3: Hybrid Approach
-
-Combine both strategies for best results:
-
-```python
-# Generate both pure and LLM-enhanced embeddings
-# 1. Pure vector for exact code matching
-generate_pure_embeddings(files)
-
-# 2. LLM-enhanced for semantic matching
-generate_llm_embeddings(files)
-
-# Search uses both and ranks by best match
-```
-
-## Performance Considerations
-
-### Speed Comparison
-
-| Approach | Indexing Time (100 files) | Query Time | Cost |
-|----------|---------------------------|------------|------|
-| Pure Vector | ~30s | ~50ms | Free |
-| LLM-Enhanced | ~5-10 min | ~50ms | LLM API costs |
-
-**LLM indexing is slower** because:
- Calls external LLM API (gemini/qwen)
- Processes files in batches (default: 5 files/batch)
- Waits for LLM response (~2-5s per batch)
-
-**Query speed is identical** because:
- Both use fastembed for similarity search
- Vector lookup is same speed
- Difference is only in what was embedded
-
-### Cost Estimation
-
-**Gemini Flash (via CCW)**:
- ~$0.10 per 1M input tokens
- Average: ~500 tokens per file
- 100 files = ~$0.005 (half a cent)
-
-**Qwen (local)**:
- Free if running locally
- Slower than Gemini Flash
-
-### When to Use Each Approach
-
-| Use Case | Recommendation |
-|----------|----------------|
-| **Code pattern search** | Pure vector (e.g., "find all REST endpoints") |
-| **Natural language queries** | LLM-enhanced (e.g., "how to authenticate users") |
-| **Large codebase** | Pure vector first, LLM for important modules |
-| **Personal projects** | LLM-enhanced (cost is minimal) |
-| **Enterprise** | Hybrid approach |
-
-## Configuration Options
-
-### LLM Config
-
-```python
-from codexlens.semantic.llm_enhancer import LLMConfig, LLMEnhancer
-
-config = LLMConfig(
-    tool="gemini",              # Primary LLM tool
-    fallback_tool="qwen",       # Fallback if primary fails
-    timeout_ms=300000,          # 5 minute timeout
-    batch_size=5,               # Files per batch
-    max_content_chars=8000,     # Max chars per file in prompt
-    enabled=True,               # Enable/disable LLM
-)
-
-enhancer = LLMEnhancer(config)
-```
-
-### Environment Variables
-
-```bash
-# Override default LLM tool
-export CCW_CLI_SECONDARY_TOOL=gemini
-
-# Override fallback tool
-export CCW_CLI_FALLBACK_TOOL=qwen
-
-# Disable LLM enhancement (fall back to pure vector)
-export CODEXLENS_LLM_ENABLED=false
-```
-
-## Troubleshooting
-
-### Issue 1: CCW CLI Not Found
-
-**Error**: `CCW CLI not found in PATH, LLM enhancement disabled`
-
-**Solution**:
-```bash
-# Install CCW globally
-npm install -g ccw
-
-# Verify installation
-ccw --version
-
-# Check PATH
-which ccw  # Unix
-where ccw  # Windows
-```
-
-### Issue 2: LLM API Errors
-
-**Error**: `LLM call failed: HTTP 429 Too Many Requests`
-
-**Solution**:
- Reduce batch size in LLMConfig
- Add delay between batches
- Check API quota/limits
- Try fallback tool (qwen)
-
-### Issue 3: Poor LLM Summaries
-
-**Symptom**: LLM summaries are too generic or inaccurate
-
-**Solution**:
- Try different LLM tool (gemini vs qwen)
- Increase max_content_chars (default 8000)
- Manually review and refine summaries
- Fall back to pure vector for code-heavy files
-
-### Issue 4: Slow Indexing
-
-**Symptom**: Indexing takes too long with LLM enhancement
-
-**Solution**:
-```python
-# Reduce batch size for faster feedback
-config = LLMConfig(batch_size=2)  # Default is 5
-
-# Or use pure vector for large files
-if file_size > 10000:
-    use_pure_vector()
-else:
-    use_llm_enhanced()
-```
-
-## Example Test Queries
-
-### Good for LLM-Enhanced Search
-
-```python
-# Natural language, conceptual queries
-"How do I authenticate users with JWT?"
-"Validate email addresses before saving to database"
-"Secure password storage with hashing"
-"Create REST API endpoint for user registration"
-"Connect to PostgreSQL with connection pooling"
-```
-
-### Good for Pure Vector Search
-
-```python
-# Code-specific, pattern-matching queries
-"bcrypt.hashpw"
-"jwt.encode"
-"@app.route POST"
-"re.match email"
-"psycopg2.pool.SimpleConnectionPool"
-```
-
-### Best: Combine Both
-
-Use LLM-enhanced for high-level search, then pure vector for refinement:
-
-```python
-# Step 1: LLM-enhanced for semantic search
-results = search_llm_enhanced("user authentication with tokens")
-# Returns: jwt_handler.py, password_hasher.py, user_endpoints.py
-
-# Step 2: Pure vector for exact code pattern
-results = search_pure_vector("jwt.encode")
-# Returns: jwt_handler.py (exact match)
-```
-
-## Future Improvements
-
- [ ] CLI integration for `--llm-enhanced` flag
- [ ] Incremental LLM summary updates
- [ ] Caching LLM summaries to reduce API calls
- [ ] Hybrid search combining both approaches
- [ ] Custom prompt templates for specific domains
- [ ] Local LLM support (ollama, llama.cpp)
-
-## Related Documentation
-
- `PURE_VECTOR_SEARCH_GUIDE.md` - Pure vector search usage
- `IMPLEMENTATION_SUMMARY.md` - Technical implementation details
- `scripts/compare_search_methods.py` - Comparison test script
- `tests/test_llm_enhanced_search.py` - Test suite
-
-## References
-
- **LLM Enhancer Implementation**: `src/codexlens/semantic/llm_enhancer.py`
- **CCW CLI Documentation**: https://github.com/anthropics/ccw
- **Fastembed**: https://github.com/qdrant/fastembed
-
---
-
-**Questions?** Run the comparison script to see LLM enhancement in action:
-```bash
-python scripts/compare_search_methods.py
-```
--- a/codex-lens/docs/LLM_ENHANCEMENT_TEST_RESULTS.md
+++ b/codex-lens/docs/LLM_ENHANCEMENT_TEST_RESULTS.md
@@ -1,232 +0,0 @@
-# LLM语义增强测试结果
-
-**测试日期**: 2025-12-16
-**状态**: ✅ 通过 - LLM增强功能正常工作
-
---
-
-## 📊 测试结果概览
-
-### 测试配置
-
-| 项目 | 配置 |
-|------|------|
-| **测试工具** | Gemini Flash 2.5 (via CCW CLI) |
-| **测试数据** | 5个Python代码文件 |
-| **查询数量** | 5个自然语言查询 |
-| **嵌入模型** | BAAI/bge-small-en-v1.5 (768维) |
-
-### 性能对比
-
-| 指标 | 纯向量搜索 | LLM增强搜索 | 差异 |
-|------|-----------|------------|------|
-| **索引时间** | 2.3秒 | 174.2秒 | 75倍慢 |
-| **查询速度** | ~50ms | ~50ms | 相同 |
-| **准确率** | 5/5 (100%) | 5/5 (100%) | 相同 |
-| **排名得分** | 15/15 | 15/15 | 平局 |
-
-### 详细结果
-
-所有5个查询都找到了正确的文件 (Rank 1):
-
-| 查询 | 预期文件 | 纯向量 | LLM增强 |
-|------|---------|--------|---------|
-| 如何安全地哈希密码？ | password_hasher.py | [OK] Rank 1 | [OK] Rank 1 |
-| 生成JWT令牌进行认证 | jwt_handler.py | [OK] Rank 1 | [OK] Rank 1 |
-| 通过API创建新用户账户 | user_endpoints.py | [OK] Rank 1 | [OK] Rank 1 |
-| 验证电子邮件地址格式 | validation.py | [OK] Rank 1 | [OK] Rank 1 |
-| 连接到PostgreSQL数据库 | connection.py | [OK] Rank 1 | [OK] Rank 1 |
-
---
-
-## ✅ 验证结论
-
-### 1. LLM增强功能工作正常
-
- ✅ **CCW CLI集成**: 成功调用外部CLI工具
- ✅ **Gemini API**: API调用成功，无错误
- ✅ **摘要生成**: LLM成功生成代码摘要和关键词
- ✅ **嵌入创建**: 从摘要成功生成768维向量
- ✅ **向量存储**: 正确存储到semantic_chunks表
- ✅ **搜索准确性**: 100%准确匹配所有查询
-
-### 2. 性能权衡分析
-
-**优势**:
- 查询速度与纯向量相同 (~50ms)
- 更好的语义理解能力 (理论上)
- 适合自然语言查询
-
-**劣势**:
- 索引阶段慢75倍 (174s vs 2.3s)
- 需要外部LLM API (成本)
- 需要安装和配置CCW CLI
-
-**适用场景**:
- 离线索引，在线查询
- 个人项目 (成本可忽略)
- 重视自然语言查询体验
-
-### 3. 测试数据集局限性
-
-**当前测试太简单**:
- 仅5个文件
- 每个查询完美对应1个文件
- 没有歧义或相似文件
- 两种方法都能轻松找到
-
-**预期在真实场景**:
- 数百或数千个文件
- 多个相似功能的文件
- 模糊或概念性查询
- LLM增强应该表现更好
-
---
-
-## 🛠️ 测试基础设施
-
-### 创建的文件
-
-1. **测试套件** (`tests/test_llm_enhanced_search.py`)
-   - 550+ lines
-   - 完整pytest测试
-   - 3个测试类 (纯向量, LLM增强, 对比)
-
-2. **独立脚本** (`scripts/compare_search_methods.py`)
-   - 460+ lines
-   - 可直接运行: `python scripts/compare_search_methods.py`
-   - 支持参数: `--tool gemini|qwen`, `--skip-llm`
-   - 详细对比报告
-
-3. **完整文档** (`docs/LLM_ENHANCED_SEARCH_GUIDE.md`)
-   - 460+ lines
-   - 架构对比图
-   - 设置说明
-   - 使用示例
-   - 故障排除
-
-### 运行测试
-
-```bash
-# 方式1: 独立脚本 (推荐)
-python scripts/compare_search_methods.py --tool gemini
-
-# 方式2: Pytest
-pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s
-
-# 跳过LLM测试 (仅测试纯向量)
-python scripts/compare_search_methods.py --skip-llm
-```
-
-### 前置要求
-
-```bash
-# 1. 安装语义搜索依赖
-pip install codexlens[semantic]
-
-# 2. 安装CCW CLI
-npm install -g ccw
-
-# 3. 配置API密钥
-ccw config set gemini.apiKey YOUR_API_KEY
-```
-
---
-
-## 🔍 架构对比
-
-### 纯向量搜索流程
-
-```
-代码文件 → 分块 → fastembed (768维) → semantic_chunks表 → 向量搜索
-```
-
-**优点**: 快速、无需外部依赖、直接嵌入代码
-**缺点**: 对自然语言查询理解较弱
-
-### LLM增强搜索流程
-
-```
-代码文件 → CCW CLI调用Gemini → 生成摘要+关键词 → fastembed (768维) → semantic_chunks表 → 向量搜索
-```
-
-**优点**: 更好的语义理解、适合自然语言查询
-**缺点**: 索引慢75倍、需要LLM API、有成本
-
---
-
-## 💰 成本估算
-
-### Gemini Flash (via CCW)
-
- 价格: ~$0.10 / 1M input tokens
- 平均: ~500 tokens / 文件
- 100文件成本: ~$0.005 (半分钱)
-
-### Qwen (本地)
-
- 价格: 免费 (本地运行)
- 速度: 比Gemini Flash慢
-
---
-
-## 📝 修复的问题
-
-### 1. Unicode编码问题
-
-**问题**: Windows GBK控制台无法显示Unicode符号 (✓, ✗, •)
-**修复**: 替换为ASCII符号 ([OK], [X], -)
-
-**影响文件**:
- `scripts/compare_search_methods.py`
- `tests/test_llm_enhanced_search.py`
-
-### 2. 数据库文件锁定
-
-**问题**: Windows无法删除临时数据库 (PermissionError)
-**修复**: 添加垃圾回收和异常处理
-
-```python
-import gc
-gc.collect()  # 强制关闭连接
-time.sleep(0.1)  # 等待Windows释放文件句柄
-```
-
-### 3. 正则表达式警告
-
-**问题**: SyntaxWarning about invalid escape sequence `\.`
-**状态**: 无害警告，正则表达式正常工作
-
---
-
-## 🎯 结论和建议
-
-### 核心发现
-
-1. ✅ **LLM语义增强功能已验证可用**
-2. ✅ **测试基础设施完整**
-3. ⚠️ **测试数据集需扩展** (当前太简单)
-
-### 使用建议
-
-| 场景 | 推荐方案 |
-|------|---------|
-| 代码模式搜索 | 纯向量 (如 "find all REST endpoints") |
-| 自然语言查询 | LLM增强 (如 "how to authenticate users") |
-| 大型代码库 | 纯向量优先，重要模块用LLM |
-| 个人项目 | LLM增强 (成本可忽略) |
-| 企业级应用 | 混合方案 |
-
-### 后续工作 (可选)
-
- [ ] 使用更大的测试数据集 (100+ files)
- [ ] 测试更复杂的查询 (概念性、模糊查询)
- [ ] 性能优化 (批量LLM调用)
- [ ] 成本优化 (缓存LLM摘要)
- [ ] 混合搜索 (结合两种方法)
-
---
-
-**完成时间**: 2025-12-16
-**测试执行者**: Claude (Sonnet 4.5)
-**文档版本**: 1.0
--- a/codex-lens/docs/LLM_REMOVAL_SUMMARY.md
+++ b/codex-lens/docs/LLM_REMOVAL_SUMMARY.md
@@ -0,0 +1,342 @@
+# LLM增强功能移除总结
+
+**移除日期**: 2025-12-16
+**执行者**: 用户请求
+**状态**: ✅ 完成
+
+---
+
+## 📋 移除清单
+
+### ✅ 已删除的源代码文件
+
+| 文件 | 说明 |
+|------|------|
+| `src/codexlens/semantic/llm_enhancer.py` | LLM增强核心模块 (900+ lines) |
+
+### ✅ 已修改的源代码文件
+
+| 文件 | 修改内容 |
+|------|---------|
+| `src/codexlens/cli/commands.py` | 删除 `enhance` 命令 (lines 1050-1227) |
+| `src/codexlens/semantic/__init__.py` | 删除LLM相关导出 (lines 35-69) |
+
+### ✅ 已修改的前端文件（CCW Dashboard）
+
+| 文件 | 修改内容 |
+|------|---------|
+| `ccw/src/templates/dashboard-js/components/cli-status.js` | 删除LLM增强设置 (8行)、Semantic Settings Modal (615行)、Metadata Viewer (326行) |
+| `ccw/src/templates/dashboard-js/i18n.js` | 删除英文LLM翻译 (26行)、中文LLM翻译 (26行) |
+| `ccw/src/templates/dashboard-js/views/cli-manager.js` | 移除LLM badge和设置modal调用 (3行) |
+
+### ✅ 已删除的测试文件
+
+| 文件 | 说明 |
+|------|------|
+| `tests/test_llm_enhancer.py` | LLM增强单元测试 |
+| `tests/test_llm_enhanced_search.py` | LLM vs 纯向量对比测试 (550+ lines) |
+
+### ✅ 已删除的脚本文件
+
+| 文件 | 说明 |
+|------|------|
+| `scripts/compare_search_methods.py` | 纯向量 vs LLM增强对比脚本 (460+ lines) |
+| `scripts/test_misleading_comments.py` | 误导性注释测试脚本 (490+ lines) |
+| `scripts/show_llm_analysis.py` | LLM分析展示工具 |
+| `scripts/inspect_llm_summaries.py` | LLM摘要检查工具 |
+
+### ✅ 已删除的文档文件
+
+| 文件 | 说明 |
+|------|------|
+| `docs/LLM_ENHANCED_SEARCH_GUIDE.md` | LLM增强使用指南 (460+ lines) |
+| `docs/LLM_ENHANCEMENT_TEST_RESULTS.md` | LLM测试结果文档 |
+| `docs/MISLEADING_COMMENTS_TEST_RESULTS.md` | 误导性注释测试结果 |
+| `docs/CLI_INTEGRATION_SUMMARY.md` | CLI集成文档（包含enhance命令） |
+| `docs/DOCSTRING_LLM_HYBRID_DESIGN.md` | Docstring与LLM混合策略设计 |
+
+### ✅ 已更新的文档
+
+| 文件 | 修改内容 |
+|------|---------|
+| `docs/IMPLEMENTATION_SUMMARY.md` | 添加LLM移除说明，列出已删除内容 |
+
+### 📚 保留的设计文档（作为历史参考）
+
+| 文件 | 说明 |
+|------|------|
+| `docs/DESIGN_EVALUATION_REPORT.md` | 包含LLM混合策略的技术评估报告 |
+| `docs/SEMANTIC_GRAPH_DESIGN.md` | 语义图谱设计（可能提及LLM） |
+| `docs/MULTILEVEL_CHUNKER_DESIGN.md` | 多层次分词器设计（可能提及LLM） |
+
+*这些文档保留作为技术历史参考，不影响当前功能。*
+
+---
+
+## 🔒 移除的功能
+
+### CLI命令
+
+```bash
+# 已移除 - 不再可用
+codexlens enhance [PATH] --tool gemini --batch-size 5
+
+# 说明：此命令用于通过CCW CLI调用Gemini/Qwen生成代码摘要
+# 移除原因：减少外部依赖，简化维护
+```
+
+### Python API
+
+```python
+# 已移除 - 不再可用
+from codexlens.semantic import (
+    LLMEnhancer,
+    LLMConfig,
+    SemanticMetadata,
+    FileData,
+    EnhancedSemanticIndexer,
+    create_enhancer,
+    create_enhanced_indexer,
+)
+
+# 移除的类和函数：
+# - LLMEnhancer: LLM增强器主类
+# - LLMConfig: LLM配置类
+# - SemanticMetadata: 语义元数据结构
+# - FileData: 文件数据结构
+# - EnhancedSemanticIndexer: LLM增强索引器
+# - create_enhancer(): 创建增强器的工厂函数
+# - create_enhanced_indexer(): 创建增强索引器的工厂函数
+```
+
+---
+
+## ✅ 保留的功能
+
+### 完全保留的核心功能
+
+| 功能 | 状态 |
+|------|------|
+| **纯向量搜索** | ✅ 完整保留 |
+| **语义嵌入生成** | ✅ 完整保留 (`codexlens embeddings-generate`) |
+| **语义嵌入状态检查** | ✅ 完整保留 (`codexlens embeddings-status`) |
+| **混合搜索引擎** | ✅ 完整保留（exact + fuzzy + vector） |
+| **向量存储** | ✅ 完整保留 |
+| **语义分块** | ✅ 完整保留 |
+| **fastembed集成** | ✅ 完整保留 |
+
+### 可用的CLI命令
+
+```bash
+# 生成纯向量嵌入（无需LLM）
+codexlens embeddings-generate [PATH]
+
+# 检查嵌入状态
+codexlens embeddings-status [PATH]
+
+# 所有搜索命令
+codexlens search [QUERY] --index [PATH]
+
+# 所有索引管理命令
+codexlens init [PATH]
+codexlens update [PATH]
+codexlens clean [PATH]
+```
+
+### 可用的Python API
+
+```python
+# 完全可用 - 纯向量搜索
+from codexlens.semantic import SEMANTIC_AVAILABLE, SEMANTIC_BACKEND
+from codexlens.semantic.embedder import Embedder
+from codexlens.semantic.vector_store import VectorStore
+from codexlens.semantic.chunker import Chunker, ChunkConfig
+from codexlens.search.hybrid_search import HybridSearchEngine
+
+# 示例：纯向量搜索
+engine = HybridSearchEngine()
+results = engine.search(
+    index_path,
+    query="your search query",
+    enable_vector=True,
+    pure_vector=True,  # 纯向量模式
+)
+```
+
+---
+
+## 🎯 移除原因
+
+### 1. 简化依赖
+
+**移除的外部依赖**:
+- CCW CLI (npm package)
+- Gemini API (需要API密钥)
+- Qwen API (可选)
+
+**保留的依赖**:
+- fastembed (ONNX-based，轻量级)
+- numpy
+- Python标准库
+
+### 2. 减少复杂性
+
+- **前**: 两种搜索方式（纯向量 + LLM增强）
+- **后**: 一种搜索方式（纯向量）
+- 移除了900+ lines的LLM增强代码
+- 移除了CLI命令和相关配置
+- 移除了测试和文档
+
+### 3. 性能考虑
+
+| 方面 | LLM增强 | 纯向量 |
+|------|---------|--------|
+| **索引速度** | 慢75倍 | 基准 |
+| **查询速度** | 相同 | 相同 |
+| **准确率** | 相同* | 基准 |
+| **成本** | API费用 | 免费 |
+
+*在测试数据集上准确率相同（5/5），但LLM增强理论上在更复杂场景下可能更好
+
+### 4. 维护负担
+
+**移除前**:
+- 需要维护CCW CLI集成
+- 需要处理API限流和错误
+- 需要测试多个LLM后端
+- 需要维护批处理逻辑
+
+**移除后**:
+- 单一嵌入引擎（fastembed）
+- 无外部API依赖
+- 更简单的错误处理
+- 更容易测试
+
+---
+
+## 🔍 验证结果
+
+### 导入测试
+
+```bash
+# ✅ 通过 - 语义模块正常
+python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"
+# Output: True
+
+# ✅ 通过 - 搜索引擎正常
+python -c "from codexlens.search.hybrid_search import HybridSearchEngine; print('OK')"
+# Output: OK
+```
+
+### 代码清洁度验证
+
+```bash
+# ✅ 通过 - 无遗留LLM引用
+grep -r "llm_enhancer\|LLMEnhancer\|LLMConfig" src/ --include="*.py"
+# Output: (空)
+```
+
+### 测试结果
+
+```bash
+# ✅ 5/7通过 - 纯向量搜索基本功能正常
+pytest tests/test_pure_vector_search.py -v
+# 通过: 5个基本测试
+# 失败: 2个嵌入测试（已知的模型维度不匹配问题，与LLM移除无关）
+```
+
+---
+
+## 📊 统计
+
+### 代码删除统计
+
+| 类型 | 删除文件数 | 删除行数（估计） |
+|------|-----------|-----------------|
+| **源代码** | 1 | ~900 lines |
+| **CLI命令** | 1 command | ~180 lines |
+| **导出清理** | 1 section | ~35 lines |
+| **前端代码** | 3 files | ~1000 lines |
+| **测试文件** | 2 | ~600 lines |
+| **脚本工具** | 4 | ~1500 lines |
+| **文档** | 5 | ~2000 lines |
+| **总计** | 16 files/sections | ~6200 lines |
+
+### 依赖简化
+
+| 方面 | 移除前 | 移除后 |
+|------|--------|--------|
+| **外部工具依赖** | CCW CLI, Gemini/Qwen | 无 |
+| **Python包依赖** | fastembed, numpy | fastembed, numpy |
+| **API依赖** | Gemini/Qwen API | 无 |
+| **配置复杂度** | 高（tool, batch_size, API keys） | 低（model profile） |
+
+---
+
+## 🚀 后续建议
+
+### 如果需要LLM增强功能
+
+1. **从git历史恢复**
+   ```bash
+   # 查看删除前的提交
+   git log --all --full-history -- "*llm_enhancer*"
+
+   # 恢复特定文件
+   git checkout <commit-hash> -- src/codexlens/semantic/llm_enhancer.py
+   ```
+
+2. **或使用外部工具**
+   - 在索引前使用独立脚本生成摘要
+   - 将摘要作为注释添加到代码中
+   - 然后使用纯向量索引（会包含摘要）
+
+3. **或考虑轻量级替代方案**
+   - 使用本地小模型（llama.cpp, ggml）
+   - 使用docstring提取（无需LLM）
+   - 使用静态分析生成摘要
+
+### 代码库维护建议
+
+1. ✅ **保持简单** - 继续使用纯向量搜索
+2. ✅ **优化现有功能** - 改进向量搜索准确性
+3. ✅ **增量改进** - 优化分块策略和嵌入质量
+4. ⚠️ **避免重复** - 如需LLM，先评估是否真正必要
+
+---
+
+## 📝 文件清单
+
+### 删除的文件完整列表
+
+```
+src/codexlens/semantic/llm_enhancer.py
+tests/test_llm_enhancer.py
+tests/test_llm_enhanced_search.py
+scripts/compare_search_methods.py
+scripts/test_misleading_comments.py
+scripts/show_llm_analysis.py
+scripts/inspect_llm_summaries.py
+docs/LLM_ENHANCED_SEARCH_GUIDE.md
+docs/LLM_ENHANCEMENT_TEST_RESULTS.md
+docs/MISLEADING_COMMENTS_TEST_RESULTS.md
+docs/CLI_INTEGRATION_SUMMARY.md
+docs/DOCSTRING_LLM_HYBRID_DESIGN.md
+```
+
+### 修改的文件
+
+```
+src/codexlens/cli/commands.py (删除enhance命令)
+src/codexlens/semantic/__init__.py (删除LLM导出)
+ccw/src/templates/dashboard-js/components/cli-status.js (删除LLM配置、Settings Modal、Metadata Viewer)
+ccw/src/templates/dashboard-js/i18n.js (删除LLM翻译字符串)
+ccw/src/templates/dashboard-js/views/cli-manager.js (移除LLM badge和modal调用)
+docs/IMPLEMENTATION_SUMMARY.md (添加移除说明)
+```
+
+---
+
+**移除完成时间**: 2025-12-16
+**文档版本**: 1.0
+**验证状态**: ✅ 通过
--- a/codex-lens/docs/MISLEADING_COMMENTS_TEST_RESULTS.md
+++ b/codex-lens/docs/MISLEADING_COMMENTS_TEST_RESULTS.md
@@ -1,301 +0,0 @@
-# 误导性注释测试结果
-
-**测试日期**: 2025-12-16
-**测试目的**: 验证LLM增强搜索是否能克服错误/缺失的代码注释
-
---
-
-## 📊 测试结果总结
-
-### 性能对比
-
-| 方法 | 索引时间 | 准确率 | 得分 | 结论 |
-|------|---------|--------|------|------|
-| **纯向量搜索** | 2.1秒 | 5/5 (100%) | 15/15 | ✅ 未被误导性注释影响 |
-| **LLM增强搜索** | 103.7秒 | 5/5 (100%) | 15/15 | ✅ 正确识别实际功能 |
-
-**结论**: 平局 - 两种方法都能正确处理误导性注释
-
---
-
-## 🧪 测试数据集设计
-
-### 误导性代码样本 (5个文件)
-
-| 文件 | 错误注释 | 实际功能 | 误导程度 |
-|------|---------|---------|---------|
-| `crypto/hasher.py` | "Simple string utilities" | bcrypt密码哈希 | 高 |
-| `auth/token.py` | 无注释，模糊函数名 | JWT令牌生成 | 中 |
-| `api/handlers.py` | "Database utilities", 反向docstrings | REST API用户管理 | 极高 |
-| `utils/checker.py` | "Math calculation functions" | 邮箱地址验证 | 高 |
-| `db/pool.py` | "Email sending service" | PostgreSQL连接池 | 极高 |
-
-### 具体误导示例
-
-#### 示例 1: 完全错误的模块描述
-
-```python
-"""Email sending service."""  # 错误！
-import psycopg2  # 实际是数据库库
-from psycopg2 import pool
-
-class EmailSender:  # 错误的类名
-    """SMTP email sender with retry logic."""  # 错误！
-
-    def __init__(self, min_conn: int = 1, max_conn: int = 10):
-        """Initialize email sender."""  # 错误！
-        self.pool = psycopg2.pool.SimpleConnectionPool(...)  # 实际是DB连接池
-```
-
-**实际功能**: PostgreSQL数据库连接池管理器
-**注释声称**: SMTP邮件发送服务
-
-#### 示例 2: 反向的函数文档
-
-```python
-@app.route('/api/items', methods=['POST'])
-def create_item():
-    """Delete an existing item."""  # 完全相反！
-    data = request.get_json()
-    # 实际是创建新项目
-    return jsonify({'item_id': item_id}), 201
-```
-
-### 测试查询 (基于实际功能)
-
-| 查询 | 预期文件 | 查询难度 |
-|------|---------|---------|
-| "Hash passwords securely with bcrypt" | `crypto/hasher.py` | 高 - 注释说string utils |
-| "Generate JWT authentication token" | `auth/token.py` | 中 - 无注释 |
-| "Create user account REST API endpoint" | `api/handlers.py` | 高 - 注释说database |
-| "Validate email address format" | `utils/checker.py` | 高 - 注释说math |
-| "PostgreSQL database connection pool" | `db/pool.py` | 极高 - 注释说email |
-
---
-
-## 🔍 LLM分析能力验证
-
-### 直接测试: LLM如何理解误导性代码
-
-**测试代码**: `db/pool.py` (声称是"Email sending service")
-
-**Gemini分析结果**:
-
-```
-Summary: This Python module defines an `EmailSender` class that manages
-a PostgreSQL connection pool for an email sending service, using
-`psycopg2` for database interactions. It provides a context manager
-`send_email` to handle connection acquisition, transaction commitment,
-and release back to the pool.
-
-Purpose: data
-
-Keywords: psycopg2, connection pool, PostgreSQL, database, email sender,
-context manager, python, database connection, transaction
-```
-
-**分析得分**:
- ✅ **正确识别的术语** (5/5): PostgreSQL, connection pool, database, psycopg2, database connection
- ⚠️ **误导性术语** (2/3): email sender, email sending service (但上下文正确)
-
-**结论**: LLM正确识别了实际功能（PostgreSQL connection pool），虽然摘要开头提到了错误的module docstring，但核心描述准确。
-
---
-
-## 💡 关键发现
-
-### 1. 为什么纯向量搜索也能工作？
-
-**原因**: 代码中的技术关键词权重高于注释
-
-```python
-# 这些强信号即使有错误注释也能正确匹配
-import bcrypt          # 强信号: 密码哈希
-import jwt             # 强信号: JWT令牌
-import psycopg2        # 强信号: PostgreSQL
-from flask import Flask, request  # 强信号: REST API
-pattern = r'^[a-zA-Z0-9._%+-]+@'  # 强信号: 邮箱验证
-```
-
-**嵌入模型的优势**:
- 代码标识符（bcrypt, jwt, psycopg2）具有高度特异性
- import语句权重高
- 正则表达式模式具有语义信息
- 框架API调用（Flask路由）提供明确上下文
-
-### 2. LLM增强的价值
-
-**LLM分析过程**:
-1. ✅ 读取代码逻辑（不仅仅是注释）
-2. ✅ 识别import语句和实际使用
-3. ✅ 理解代码流程和数据流
-4. ✅ 生成基于行为的摘要
-5. ⚠️ 部分参考错误注释（但不完全依赖）
-
-**示例对比**:
-
-| 方面 | 纯向量 | LLM增强 |
-|------|--------|---------|
-| **处理内容** | 代码 + 注释 (整体嵌入) | 代码分析 → 生成摘要 |
-| **误导性注释影响** | 低 (代码关键词权重高) | 极低 (理解代码逻辑) |
-| **自然语言查询** | 依赖代码词汇匹配 | 理解语义意图 |
-| **处理速度** | 快 (2秒) | 慢 (104秒, 52倍差) |
-
-### 3. 测试数据集的局限性
-
-**为什么两种方法都表现完美**:
-
-1. **文件数量太少** (5个文件)
-   - 没有相似功能的文件竞争
-   - 每个查询有唯一的目标文件
-
-2. **代码关键词太强**
-   - bcrypt → 唯一用于密码
-   - jwt → 唯一用于令牌
-   - Flask+@app.route → 唯一的API
-   - psycopg2 → 唯一的数据库
-
-3. **查询过于具体**
-   - "bcrypt password hashing" 直接匹配代码关键词
-   - 不是概念性或模糊查询
-
-**理想的测试场景**:
- ❌ 5个唯一功能文件
- ✅ 100+文件，多个相似功能模块
- ✅ 模糊概念查询: "用户认证"而不是"bcrypt hash"
- ✅ 没有明显关键词的业务逻辑代码
-
---
-
-## 🎯 实际应用建议
-
-### 何时使用纯向量搜索
-
-✅ **推荐场景**:
- 代码库有良好文档
- 搜索代码模式和API使用
- 已知技术栈关键词
- 需要快速索引
-
-**示例查询**:
- "bcrypt.hashpw usage"
- "Flask @app.route GET method"
- "jwt.encode algorithm"
-
-### 何时使用LLM增强搜索
-
-✅ **推荐场景**:
- 代码库文档缺失或过时
- 自然语言概念性查询
- 业务逻辑搜索
- 重视搜索准确性 > 索引速度
-
-**示例查询**:
- "How to authenticate users?" (概念性)
- "Payment processing workflow" (业务逻辑)
- "Error handling for API requests" (模式搜索)
-
-### 混合策略 (推荐)
-
-| 模块类型 | 索引方式 | 原因 |
-|---------|---------|------|
-| **核心业务逻辑** | LLM增强 | 复杂逻辑，文档可能不完整 |
-| **工具函数** | 纯向量 | 代码清晰，关键词明确 |
-| **第三方集成** | 纯向量 | API调用已是最好描述 |
-| **遗留代码** | LLM增强 | 文档陈旧或缺失 |
-
---
-
-## 📈 性能与成本
-
-### 时间成本
-
-| 操作 | 纯向量 | LLM增强 | 差异 |
-|------|--------|---------|------|
-| **索引5文件** | 2.1秒 | 103.7秒 | 49倍慢 |
-| **索引100文件** | ~42秒 | ~35分钟 | ~50倍慢 |
-| **查询速度** | ~50ms | ~50ms | 相同 |
-
-### 金钱成本 (Gemini Flash)
-
- **价格**: $0.10 / 1M input tokens
- **平均**: ~500 tokens / 文件
- **100文件**: $0.005 (半分钱)
- **1000文件**: $0.05 (5分钱)
-
-**结论**: 金钱成本可忽略，时间成本是主要考虑因素
-
---
-
-## 🧪 测试工具
-
-### 创建的脚本
-
-1. **`scripts/test_misleading_comments.py`**
-   - 完整对比测试
-   - 支持 `--tool gemini|qwen`
-   - 支持 `--keep-db` 保存结果数据库
-
-2. **`scripts/show_llm_analysis.py`**
-   - 直接显示LLM对单个文件的分析
-   - 评估LLM是否被误导
-   - 计算正确/误导术语比例
-
-3. **`scripts/inspect_llm_summaries.py`**
-   - 检查数据库中的LLM摘要
-   - 查看metadata和keywords
-
-### 运行测试
-
-```bash
-# 完整对比测试
-python scripts/test_misleading_comments.py --tool gemini
-
-# 保存数据库用于检查
-python scripts/test_misleading_comments.py --keep-db ./results.db
-
-# 查看LLM对单个文件的分析
-python scripts/show_llm_analysis.py
-
-# 检查数据库中的摘要
-python scripts/inspect_llm_summaries.py results.db
-```
-
---
-
-## 📝 结论
-
-### 测试结论
-
-1. ✅ **LLM能够克服误导性注释**
-   - 正确识别实际代码功能
-   - 生成基于行为的准确摘要
-   - 不完全依赖文档字符串
-
-2. ✅ **纯向量搜索也具有抗干扰能力**
-   - 代码关键词提供强信号
-   - 技术栈名称具有高特异性
-   - import语句和API调用信息丰富
-
-3. ⚠️ **当前测试数据集太简单**
-   - 需要更大规模测试 (100+文件)
-   - 需要概念性查询测试
-   - 需要相似功能模块对比
-
-### 生产使用建议
-
-**最佳实践**: 根据代码库特征选择策略
-
-| 代码库特征 | 推荐方案 | 理由 |
-|-----------|---------|------|
-| 良好文档，清晰命名 | 纯向量 | 快速，成本低 |
-| 文档缺失/陈旧 | LLM增强 | 理解代码逻辑 |
-| 遗留系统 | LLM增强 | 克服历史包袱 |
-| 新项目 | 纯向量 | 现代代码通常更清晰 |
-| 大型企业代码库 | 混合 | 分模块策略 |
-
---
-
-**测试完成时间**: 2025-12-16
-**测试工具**: Gemini Flash 2.5, fastembed (BAAI/bge-small-en-v1.5)
-**文档版本**: 1.0