feat: add semantic graph design for static code analysis

- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities. - Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies. - Included detailed specifications for nodes and edges in the graph, along with database schema for storage. - Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
2026-02-05 01:50:27 +08:00 · 2025-12-15 09:47:18 +08:00
parent d91477ad80
commit 3ffb907a6f
17 changed files with 4557 additions and 261 deletions
--- a/.claude/rules/context-requirements.md
+++ b/.claude/rules/context-requirements.md
@@ -43,4 +43,5 @@ Before implementation, always:
 - `exact`: Known exact pattern
 - `fuzzy`: Typo-tolerant search
 - `semantic`: Concept-based search
- `graph`: Dependency analysis
+- `graph`: Dependency analysis
+
--- a/.claude/rules/file-modification.md
+++ b/.claude/rules/file-modification.md
@@ -45,3 +45,49 @@
 **Use semantic search** for exploratory tasks
 **Use indexed search** for large, stable codebases
 **Use Exa** for external/public knowledge
+
+## ⚡ Core Search Tools
+
+**rg (ripgrep)**: Fast content search with regex support
+**find**: File/directory location by name patterns
+**grep**: Built-in pattern matching (fallback when rg unavailable)
+**get_modules_by_depth**: Program architecture analysis (MANDATORY before planning)
+
+
+## 🔧 Quick Command Reference
+
+```bash
+# Semantic File Discovery (codebase-retrieval via CCW)
+ccw cli exec "
+PURPOSE: Discover files relevant to task/feature
+TASK: • List all files related to [task/feature description]
+MODE: analysis
+CONTEXT: @**/*
+EXPECTED: Relevant file paths with relevance explanation
+RULES: Focus on direct relevance to task requirements | analysis=READ-ONLY
+" --tool gemini --cd [directory]
+
+# Program Architecture (MANDATORY before planning)
+ccw tool exec get_modules_by_depth '{}'
+
+# Content Search (rg preferred)
+rg "pattern" --type js -n        # Search JS files with line numbers
+rg -i "case-insensitive"         # Ignore case
+rg -C 3 "context"                # Show 3 lines before/after
+
+# File Search
+find . -name "*.ts" -type f      # Find TypeScript files
+find . -path "*/node_modules" -prune -o -name "*.js" -print
+
+# Workflow Examples
+rg "IMPL-\d+" .workflow/ --type json                    # Find task IDs
+find .workflow/ -name "*.json" -path "*/.task/*"        # Locate task files
+rg "status.*pending" .workflow/.task/                   # Find pending tasks
+```
+
+## ⚡ Performance Tips
+
+- **rg > grep** for content search
+- **Use --type filters** to limit file types
+- **Exclude dirs**: `--glob '!node_modules'`
+- **Use -F** for literal strings (no regex)
--- a/.claude/workflows/context-search-strategy.md
+++ b/.claude/workflows/context-search-strategy.md
@@ -13,7 +13,7 @@
 **rg (ripgrep)**: Fast content search with regex support
 **find**: File/directory location by name patterns
 **grep**: Built-in pattern matching (fallback when rg unavailable)
-**get_modules_by_depth.sh**: Program architecture analysis (MANDATORY before planning)
+**get_modules_by_depth**: Program architecture analysis (MANDATORY before planning)



--- a/.mcp.json
+++ b/.mcp.json
@@ -1,22 +1,11 @@
 {
  "mcpServers": {
-    "test-mcp-server": {
-      "command": "npx",
-      "args": [
-        "-y",
-        "@modelcontextprotocol/server-filesystem",
-        "D:/Claude_dms3"
-      ]
-    },
    "ccw-tools": {
      "command": "npx",
      "args": [
        "-y",
        "ccw-mcp"
-      ],
-      "env": {
-        "CCW_ENABLED_TOOLS": "write_file,edit_file,codex_lens,smart_search"
-      }
+      ]
    }
  }
-}
+}
--- a/IMPLEMENTATION_SUMMARY.md
+++ b/IMPLEMENTATION_SUMMARY.md
@@ -1,190 +0,0 @@
-# Implementation Summary: Rules CLI Generation Feature
-
-## Status: ✅ Complete
-
-## Files Modified
-
-### D:\Claude_dms3\ccw\src\core\routes\rules-routes.ts
-
-**Changes:**
-1. Added import for `executeCliTool` from cli-executor
-2. Implemented `generateRuleViaCLI()` function
-3. Modified POST `/api/rules/create` endpoint to support `mode: 'cli-generate'`
-
-## Implementation Details
-
-### 1. New Function: `generateRuleViaCLI()`
-
-**Location:** lines 224-340
-
-**Purpose:** Generate rule content using Gemini CLI based on different generation strategies
-
-**Parameters:**
- `generationType`: 'description' | 'template' | 'extract'
- `description`: Natural language description of the rule
- `templateType`: Template category for structured generation
- `extractScope`: File pattern for code analysis (e.g., 'src/**/*.ts')
- `extractFocus`: Focus areas for extraction (e.g., 'error handling, naming')
- `fileName`: Target filename (must end with .md)
- `location`: 'project' or 'user'
- `subdirectory`: Optional subdirectory path
- `projectPath`: Project root directory
-
-**Process Flow:**
-1. Parse parameters and determine generation type
-2. Build appropriate CLI prompt template based on type
-3. Execute Gemini CLI with:
-   - Tool: 'gemini'
-   - Mode: 'write' for description/template, 'analysis' for extract
-   - Timeout: 10 minutes (600000ms)
-   - Working directory: projectPath
-4. Validate CLI execution result
-5. Extract generated content from stdout
-6. Call `createRule()` to save the file
-7. Return result with execution ID
-
-### 2. Prompt Templates
-
-#### Description Mode (write)
-```
-PURPOSE: Generate Claude Code memory rule from description to guide Claude's behavior
-TASK: • Analyze rule requirements • Generate markdown content with clear instructions
-MODE: write
-EXPECTED: Complete rule content in markdown format
-RULES: $(cat ~/.claude/workflows/cli-templates/prompts/universal/00-universal-rigorous-style.txt)
-```
-
-#### Template Mode (write)
-```
-PURPOSE: Generate Claude Code rule from template type
-TASK: • Create rule based on {templateType} • Generate structured markdown content
-MODE: write
-EXPECTED: Complete rule content in markdown format following template structure
-RULES: $(cat ~/.claude/workflows/cli-templates/prompts/universal/00-universal-rigorous-style.txt)
-```
-
-#### Extract Mode (analysis)
-```
-PURPOSE: Extract coding rules from existing codebase to document patterns and conventions
-TASK: • Analyze code patterns • Extract common conventions • Identify best practices
-MODE: analysis
-CONTEXT: @{extractScope || '**/*'}
-EXPECTED: Rule content based on codebase analysis with examples
-RULES: $(cat ~/.claude/workflows/cli-templates/prompts/analysis/02-analyze-code-patterns.txt)
-```
-
-### 3. API Endpoint Modification
-
-**Endpoint:** POST `/api/rules/create`
-
-**Enhanced Request Body:**
-```json
-{
-  "mode": "cli-generate",           // NEW: triggers CLI generation
-  "generationType": "description",  // NEW: 'description' | 'template' | 'extract'
-  "description": "...",             // NEW: for description mode
-  "templateType": "...",            // NEW: for template mode
-  "extractScope": "src/**/*.ts",    // NEW: for extract mode
-  "extractFocus": "...",            // NEW: for extract mode
-  "fileName": "rule-name.md",       // REQUIRED
-  "location": "project",            // REQUIRED: 'project' | 'user'
-  "subdirectory": "",               // OPTIONAL
-  "projectPath": "..."              // OPTIONAL: defaults to initialPath
-}
-```
-
-**Backward Compatibility:** Existing manual creation still works:
-```json
-{
-  "fileName": "rule-name.md",
-  "content": "# Rule Content\n...",
-  "location": "project",
-  "paths": [],
-  "subdirectory": ""
-}
-```
-
-**Response Format:**
-```json
-{
-  "success": true,
-  "fileName": "rule-name.md",
-  "location": "project",
-  "path": "/absolute/path/to/rule-name.md",
-  "subdirectory": null,
-  "generatedContent": "# Generated Content\n...",
-  "executionId": "1734168000000-gemini"
-}
-```
-
-## Error Handling
-
-### Validation Errors
- Missing `fileName`: "File name is required"
- Missing `location`: "Location is required (project or user)"
- Missing `generationType` in CLI mode: "generationType is required for CLI generation mode"
- Missing `description` for description mode: "description is required for description-based generation"
- Missing `templateType` for template mode: "templateType is required for template-based generation"
- Unknown `generationType`: "Unknown generation type: {type}"
-
-### CLI Execution Errors
- CLI tool failure: Returns `{ error: "CLI execution failed: ...", stderr: "..." }`
- Empty content: Returns `{ error: "CLI execution returned empty content", stdout: "...", stderr: "..." }`
- Timeout: CLI executor will timeout after 10 minutes
- File exists: "Rule '{fileName}' already exists in {location} location"
-
-## Testing
-
-### Test Document
-Created: `D:\Claude_dms3\test-rules-cli-generation.md`
-
-Contains:
- API usage examples for all 3 generation types
- Request/response format examples
- Error handling scenarios
- Integration details
-
-### Compilation Test
-✅ TypeScript compilation successful (`npm run build`)
-
-## Integration Points
-
-### Dependencies
- **cli-executor.ts**: Provides `executeCliTool()` for Gemini execution
- **createRule()**: Existing function for file creation
- **handlePostRequest()**: Existing request handler from RouteContext
-
-### CLI Tool
- **Tool**: Gemini (via `executeCliTool()`)
- **Timeout**: 10 minutes (600000ms)
- **Mode**: 'write' for generation, 'analysis' for extraction
- **Working Directory**: Project path for context access
-
-## Next Steps (Not Implemented)
-
-1. **UI Integration**: Add frontend interface in Rules Manager dashboard
-2. **Streaming Output**: Display CLI execution progress in real-time
-3. **Preview**: Show generated content before saving
-4. **Refinement**: Allow iterative refinement of generated rules
-5. **Templates Library**: Add predefined template types
-6. **History**: Track generation history and allow regeneration
-
-## Verification Checklist
-
- [x] Import cli-executor functions
- [x] Implement `generateRuleViaCLI()` with 3 generation types
- [x] Build appropriate prompts for each type
- [x] Use correct MODE (analysis vs write)
- [x] Set timeout to at least 10 minutes
- [x] Integrate with `createRule()` for file creation
- [x] Modify POST endpoint to support `mode: 'cli-generate'`
- [x] Validate required parameters
- [x] Return unified result format
- [x] Handle errors appropriately
- [x] Maintain backward compatibility
- [x] Verify TypeScript compilation
- [x] Create test documentation
-
-## Files Created
- `D:\Claude_dms3\test-rules-cli-generation.md`: Test documentation
- `D:\Claude_dms3\IMPLEMENTATION_SUMMARY.md`: This file
--- a/ccw/src/core/routes/mcp-routes.ts
+++ b/ccw/src/core/routes/mcp-routes.ts
@@ -77,7 +77,7 @@ function getMcpServersFromFile(filePath) {
 */
 function addMcpServerToMcpJson(projectPath, serverName, serverConfig) {
  try {
-    const normalizedPath = normalizeProjectPathForConfig(projectPath);
+    const normalizedPath = normalizePathForFileSystem(projectPath);
    const mcpJsonPath = join(normalizedPath, '.mcp.json');
    
    // Read existing .mcp.json or create new structure
@@ -115,7 +115,7 @@ function addMcpServerToMcpJson(projectPath, serverName, serverConfig) {
 */
 function removeMcpServerFromMcpJson(projectPath, serverName) {
  try {
-    const normalizedPath = normalizeProjectPathForConfig(projectPath);
+    const normalizedPath = normalizePathForFileSystem(projectPath);
    const mcpJsonPath = join(normalizedPath, '.mcp.json');
    
    if (!existsSync(mcpJsonPath)) {
@@ -238,22 +238,43 @@ function getMcpConfig() {
 }

 /**
- * Normalize project path for .claude.json (Windows backslash format)
+ * Normalize path to filesystem format (for accessing .mcp.json files)
+ * Always uses forward slashes for cross-platform compatibility
 * @param {string} path
 * @returns {string}
 */
-function normalizeProjectPathForConfig(path) {
-  // Convert forward slashes to backslashes for Windows .claude.json format
-  let normalized = path.replace(/\//g, '\\');
-
-  // Handle /d/path format -> D:\path
-  if (normalized.match(/^\\[a-zA-Z]\\/)) {
+function normalizePathForFileSystem(path) {
+  let normalized = path.replace(/\\/g, '/');
+  
+  // Handle /d/path format -> D:/path
+  if (normalized.match(/^\/[a-zA-Z]\//)) {
    normalized = normalized.charAt(1).toUpperCase() + ':' + normalized.slice(2);
  }
-
+  
  return normalized;
 }

+/**
+ * Normalize project path to match existing format in .claude.json
+ * Checks both forward slash and backslash formats to find existing entry
+ * @param {string} path
+ * @param {Object} claudeConfig - Optional existing config to check format
+ * @returns {string}
+ */
+function normalizeProjectPathForConfig(path, claudeConfig = null) {
+  // IMPORTANT: Always normalize to forward slashes to prevent duplicate entries
+  // (e.g., prevents both "D:/Claude_dms3" and "D:\\Claude_dms3")
+  let normalizedForward = path.replace(/\\/g, '/');
+
+  // Handle /d/path format -> D:/path
+  if (normalizedForward.match(/^\/[a-zA-Z]\//)) {
+    normalizedForward = normalizedForward.charAt(1).toUpperCase() + ':' + normalizedForward.slice(2);
+  }
+
+  // ALWAYS return forward slash format to prevent duplicates
+  return normalizedForward;
+}
+
 /**
 * Toggle MCP server enabled/disabled
 * @param {string} projectPath
@@ -270,7 +291,7 @@ function toggleMcpServerEnabled(projectPath, serverName, enable) {
    const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
    const config = JSON.parse(content);

-    const normalizedPath = normalizeProjectPathForConfig(projectPath);
+    const normalizedPath = normalizeProjectPathForConfig(projectPath, config);

    if (!config.projects || !config.projects[normalizedPath]) {
      return { error: `Project not found: ${normalizedPath}` };
@@ -332,7 +353,7 @@ function addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyC
    const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
    const config = JSON.parse(content);

-    const normalizedPath = normalizeProjectPathForConfig(projectPath);
+    const normalizedPath = normalizeProjectPathForConfig(projectPath, config);

    // Create project entry if it doesn't exist
    if (!config.projects) {
@@ -387,8 +408,8 @@ function addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyC
 */
 function removeMcpServerFromProject(projectPath, serverName) {
  try {
-    const normalizedPath = normalizeProjectPathForConfig(projectPath);
-    const mcpJsonPath = join(normalizedPath, '.mcp.json');
+    const normalizedPathForFile = normalizePathForFileSystem(projectPath);
+    const mcpJsonPath = join(normalizedPathForFile, '.mcp.json');
    
    let removedFromMcpJson = false;
    let removedFromClaudeJson = false;
@@ -409,6 +430,9 @@ function removeMcpServerFromProject(projectPath, serverName) {
      const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
      const config = JSON.parse(content);

+      // Get normalized path that matches existing config format
+      const normalizedPath = normalizeProjectPathForConfig(projectPath, config);
+
      if (config.projects && config.projects[normalizedPath]) {
        const projectConfig = config.projects[normalizedPath];

@@ -597,11 +621,13 @@ export async function handleMcpRoutes(ctx: RouteContext): Promise<boolean> {
  // API: Copy MCP server to project
  if (pathname === '/api/mcp-copy-server' && req.method === 'POST') {
    handlePostRequest(req, res, async (body) => {
-      const { projectPath, serverName, serverConfig } = body;
+      const { projectPath, serverName, serverConfig, configType } = body;
      if (!projectPath || !serverName || !serverConfig) {
        return { error: 'projectPath, serverName, and serverConfig are required', status: 400 };
      }
-      return addMcpServerToProject(projectPath, serverName, serverConfig);
+      // configType: 'mcp' = use .mcp.json (default), 'claude' = use .claude.json
+      const useLegacyConfig = configType === 'claude';
+      return addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyConfig);
    });
    return true;
  }
--- a/ccw/src/core/routes/mcp-templates-db.ts
+++ b/ccw/src/core/routes/mcp-templates-db.ts
--- a/ccw/src/core/routes/memory-routes.ts
+++ b/ccw/src/core/routes/memory-routes.ts
@@ -733,7 +733,7 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
    }

    try {
-      const configPath = join(projectPath, '.claude', 'rules', 'active_memory.md');
+      const configPath = join(projectPath, '.claude', 'CLAUDE.md');
      const configJsonPath = join(projectPath, '.claude', 'active_memory_config.json');
      const enabled = existsSync(configPath);
      let lastSync: string | null = null;
@@ -784,16 +784,12 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
          return;
        }

-        const rulesDir = join(projectPath, '.claude', 'rules');
        const claudeDir = join(projectPath, '.claude');
-        const configPath = join(rulesDir, 'active_memory.md');
+        const configPath = join(claudeDir, 'CLAUDE.md');
        const configJsonPath = join(claudeDir, 'active_memory_config.json');

        if (enabled) {
          // Enable: Create directories and initial file
-          if (!existsSync(rulesDir)) {
-            mkdirSync(rulesDir, { recursive: true });
-          }
          if (!existsSync(claudeDir)) {
            mkdirSync(claudeDir, { recursive: true });
          }
@@ -803,8 +799,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
            writeFileSync(configJsonPath, JSON.stringify(config, null, 2), 'utf-8');
          }

-          // Create initial active_memory.md with header
-          const initialContent = `# Active Memory
+          // Create initial CLAUDE.md with header
+          const initialContent = `# CLAUDE.md - Project Memory

 > Auto-generated understanding of frequently accessed files.
 > Last updated: ${new Date().toISOString()}
@@ -867,7 +863,7 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
    return true;
  }

-  // API: Active Memory - Sync (analyze hot files using CLI and update active_memory.md)
+  // API: Active Memory - Sync (analyze hot files using CLI and update CLAUDE.md)
  if (pathname === '/api/memory/active/sync' && req.method === 'POST') {
    let body = '';
    req.on('data', (chunk: Buffer) => { body += chunk.toString(); });
@@ -882,8 +878,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
          return;
        }

-        const rulesDir = join(projectPath, '.claude', 'rules');
-        const configPath = join(rulesDir, 'active_memory.md');
+        const claudeDir = join(projectPath, '.claude');
+        const configPath = join(claudeDir, 'CLAUDE.md');

        // Get hot files from memory store - with fallback
        let hotFiles: any[] = [];
@@ -903,8 +899,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
          return isAbsolute(filePath) ? filePath : join(projectPath, filePath);
        }).filter((p: string) => existsSync(p));

-        // Build the active memory content header
-        let content = `# Active Memory
+        // Build the CLAUDE.md content header
+        let content = `# CLAUDE.md - Project Memory

 > Auto-generated understanding of frequently accessed files using ${tool.toUpperCase()}.
 > Last updated: ${new Date().toISOString()}
@@ -942,14 +938,29 @@ RULES: Be concise. Focus on practical understanding. Include function signatures
          });

          if (result.success && result.execution?.output) {
-            // Extract stdout from output object
-            cliOutput = typeof result.execution.output === 'string'
-              ? result.execution.output
-              : result.execution.output.stdout || '';
+            // Extract stdout from output object with proper serialization
+            const output = result.execution.output;
+            if (typeof output === 'string') {
+              cliOutput = output;
+            } else if (output && typeof output === 'object') {
+              // Handle object output - extract stdout or serialize the object
+              if (output.stdout && typeof output.stdout === 'string') {
+                cliOutput = output.stdout;
+              } else if (output.stderr && typeof output.stderr === 'string') {
+                cliOutput = output.stderr;
+              } else {
+                // Last resort: serialize the entire object as JSON
+                cliOutput = JSON.stringify(output, null, 2);
+              }
+            } else {
+              cliOutput = '';
+            }
          }

-          // Add CLI output to content
-          content += cliOutput + '\n\n---\n\n';
+          // Add CLI output to content (only if not empty)
+          if (cliOutput && cliOutput.trim()) {
+            content += cliOutput + '\n\n---\n\n';
+          }

        } catch (cliErr) {
          // Fallback to basic analysis if CLI fails
@@ -1007,8 +1018,8 @@ RULES: Be concise. Focus on practical understanding. Include function signatures
        }

        // Ensure directory exists
-        if (!existsSync(rulesDir)) {
-          mkdirSync(rulesDir, { recursive: true });
+        if (!existsSync(claudeDir)) {
+          mkdirSync(claudeDir, { recursive: true });
        }

        // Write the file
--- a/ccw/src/templates/dashboard-js/components/mcp-manager.js
+++ b/ccw/src/templates/dashboard-js/components/mcp-manager.js
@@ -87,15 +87,23 @@ async function toggleMcpServer(serverName, enable) {
  }
 }

-async function copyMcpServerToProject(serverName, serverConfig) {
+async function copyMcpServerToProject(serverName, serverConfig, configType = null) {
  try {
+    // If configType not specified, ask user to choose
+    if (!configType) {
+      const choice = await showConfigTypeDialog();
+      if (!choice) return null; // User cancelled
+      configType = choice;
+    }
+
    const response = await fetch('/api/mcp-copy-server', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        projectPath: projectPath,
        serverName: serverName,
-        serverConfig: serverConfig
+        serverConfig: serverConfig,
+        configType: configType  // 'claude' for .claude.json, 'mcp' for .mcp.json
      })
    });

@@ -105,7 +113,8 @@ async function copyMcpServerToProject(serverName, serverConfig) {
    if (result.success) {
      await loadMcpConfig();
      renderMcpManager();
-      showRefreshToast(`MCP server "${serverName}" added to project`, 'success');
+      const location = configType === 'mcp' ? '.mcp.json' : '.claude.json';
+      showRefreshToast(`MCP server "${serverName}" added to project (${location})`, 'success');
    }
    return result;
  } catch (err) {
@@ -115,6 +124,53 @@ async function copyMcpServerToProject(serverName, serverConfig) {
  }
 }

+// Show dialog to let user choose config type
+function showConfigTypeDialog() {
+  return new Promise((resolve) => {
+    const dialog = document.createElement('div');
+    dialog.className = 'fixed inset-0 bg-black/50 flex items-center justify-center z-50';
+    dialog.innerHTML = `
+      <div class="bg-card border border-border rounded-lg shadow-lg p-6 max-w-md w-full mx-4">
+        <h3 class="text-lg font-semibold mb-4">${t('mcp.chooseInstallLocation')}</h3>
+        <div class="space-y-3 mb-6">
+          <button class="config-type-option w-full text-left px-4 py-3 border border-border rounded-lg hover:bg-accent hover:border-primary transition-all" data-type="claude">
+            <div class="font-medium">${t('mcp.installToClaudeJson')}</div>
+            <div class="text-sm text-muted-foreground mt-1">${t('mcp.claudeJsonDesc')}</div>
+          </button>
+          <button class="config-type-option w-full text-left px-4 py-3 border border-border rounded-lg hover:bg-accent hover:border-primary transition-all" data-type="mcp">
+            <div class="font-medium">${t('mcp.installToMcpJson')}</div>
+            <div class="text-sm text-muted-foreground mt-1">${t('mcp.mcpJsonDesc')}</div>
+          </button>
+        </div>
+        <button class="cancel-btn w-full px-4 py-2 border border-border rounded-lg hover:bg-accent transition-colors">${t('common.cancel')}</button>
+      </div>
+    `;
+    document.body.appendChild(dialog);
+
+    const options = dialog.querySelectorAll('.config-type-option');
+    options.forEach(btn => {
+      btn.addEventListener('click', () => {
+        resolve(btn.dataset.type);
+        document.body.removeChild(dialog);
+      });
+    });
+
+    const cancelBtn = dialog.querySelector('.cancel-btn');
+    cancelBtn.addEventListener('click', () => {
+      resolve(null);
+      document.body.removeChild(dialog);
+    });
+
+    // Close on backdrop click
+    dialog.addEventListener('click', (e) => {
+      if (e.target === dialog) {
+        resolve(null);
+        document.body.removeChild(dialog);
+      }
+    });
+  });
+}
+
 async function removeMcpServerFromProject(serverName) {
  try {
    const response = await fetch('/api/mcp-remove-server', {
--- a/ccw/src/templates/dashboard-js/i18n.js
+++ b/ccw/src/templates/dashboard-js/i18n.js
@@ -431,7 +431,31 @@ const i18n = {
    'mcp.jsonFormatsHint': 'Supports {"servers": {...}}, {"mcpServers": {...}}, and direct server config formats.',
    'mcp.previewServers': 'Preview (servers to be added):',
    'mcp.create': 'Create',
-    
+    'mcp.chooseInstallLocation': 'Choose Installation Location',
+    'mcp.installToClaudeJson': 'Install to .claude.json',
+    'mcp.installToMcpJson': 'Install to .mcp.json (Recommended)',
+    'mcp.claudeJsonDesc': 'Save in root .claude.json projects section (shared config)',
+    'mcp.mcpJsonDesc': 'Save in project .mcp.json file (recommended for version control)',
+
+    // MCP Templates
+    'mcp.templates': 'MCP Templates',
+    'mcp.savedTemplates': 'saved templates',
+    'mcp.saveAsTemplate': 'Save as Template',
+    'mcp.enterTemplateName': 'Enter template name',
+    'mcp.enterTemplateDesc': 'Enter template description (optional)',
+    'mcp.enterServerName': 'Enter server name',
+    'mcp.templateSaved': 'Template "{name}" saved successfully',
+    'mcp.templateSaveFailed': 'Failed to save template: {error}',
+    'mcp.templateNotFound': 'Template "{name}" not found',
+    'mcp.templateInstalled': 'Server "{name}" installed successfully',
+    'mcp.templateInstallFailed': 'Failed to install template: {error}',
+    'mcp.deleteTemplate': 'Delete Template',
+    'mcp.deleteTemplateConfirm': 'Delete template "{name}"?',
+    'mcp.templateDeleted': 'Template "{name}" deleted successfully',
+    'mcp.templateDeleteFailed': 'Failed to delete template: {error}',
+    'mcp.toProject': 'To Project',
+    'mcp.toGlobal': 'To Global',
+
    // Hook Manager
    'hook.projectHooks': 'Project Hooks',
    'hook.projectFile': '.claude/settings.json',
@@ -1346,6 +1370,11 @@ const i18n = {
    'mcp.jsonFormatsHint': '支持 {"servers": {...}}、{"mcpServers": {...}} 和直接服务器配置格式。',
    'mcp.previewServers': '预览（将添加的服务器）:',
    'mcp.create': '创建',
+    'mcp.chooseInstallLocation': '选择安装位置',
+    'mcp.installToClaudeJson': '安装到 .claude.json',
+    'mcp.installToMcpJson': '安装到 .mcp.json（推荐）',
+    'mcp.claudeJsonDesc': '保存在根目录 .claude.json projects 字段下（共享配置）',
+    'mcp.mcpJsonDesc': '保存在项目 .mcp.json 文件中（推荐用于版本控制）',
    
    // Hook Manager
    'hook.projectHooks': '项目钩子',
--- a/ccw/src/templates/dashboard-js/views/mcp-manager.js
+++ b/ccw/src/templates/dashboard-js/views/mcp-manager.js
@@ -43,6 +43,9 @@ async function renderMcpManager() {
    await loadMcpConfig();
  }

+  // Load MCP templates
+  await loadMcpTemplates();
+
  const currentPath = projectPath.replace(/\//g, '\\');
  const projectData = mcpAllProjects[currentPath] || {};
  const projectServers = projectData.mcpServers || {};
@@ -269,6 +272,77 @@ async function renderMcpManager() {
        `}
      </div>

+      <!-- MCP Templates Section -->
+      ${mcpTemplates.length > 0 ? `
+      <div class="mcp-section mt-6">
+        <div class="flex items-center justify-between mb-4">
+          <h3 class="text-lg font-semibold text-foreground flex items-center gap-2">
+            <i data-lucide="layout-template" class="w-5 h-5"></i>
+            ${t('mcp.templates')}
+          </h3>
+          <span class="text-sm text-muted-foreground">${mcpTemplates.length} ${t('mcp.savedTemplates')}</span>
+        </div>
+
+        <div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
+          ${mcpTemplates.map(template => `
+            <div class="mcp-template-card bg-card border border-border rounded-lg p-4 hover:shadow-md transition-all">
+              <div class="flex items-start justify-between mb-3">
+                <div class="flex-1 min-w-0">
+                  <h4 class="font-semibold text-foreground truncate flex items-center gap-2">
+                    <i data-lucide="layout-template" class="w-4 h-4 shrink-0"></i>
+                    <span class="truncate">${escapeHtml(template.name)}</span>
+                  </h4>
+                  ${template.description ? `
+                    <p class="text-xs text-muted-foreground mt-1 line-clamp-2">${escapeHtml(template.description)}</p>
+                  ` : ''}
+                </div>
+              </div>
+
+              <div class="mcp-server-details text-sm space-y-1 mb-3">
+                <div class="flex items-center gap-2 text-muted-foreground">
+                  <span class="font-mono text-xs bg-muted px-1.5 py-0.5 rounded">cmd</span>
+                  <span class="truncate text-xs" title="${escapeHtml(template.serverConfig.command)}">${escapeHtml(template.serverConfig.command)}</span>
+                </div>
+                ${template.serverConfig.args && template.serverConfig.args.length > 0 ? `
+                  <div class="flex items-start gap-2 text-muted-foreground">
+                    <span class="font-mono text-xs bg-muted px-1.5 py-0.5 rounded shrink-0">args</span>
+                    <span class="text-xs font-mono truncate" title="${escapeHtml(template.serverConfig.args.join(' '))}">${escapeHtml(template.serverConfig.args.slice(0, 2).join(' '))}${template.serverConfig.args.length > 2 ? '...' : ''}</span>
+                  </div>
+                ` : ''}
+              </div>
+
+              <div class="mt-3 pt-3 border-t border-border flex items-center justify-between gap-2">
+                <div class="flex items-center gap-2">
+                  <button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
+                          data-template-name="${escapeHtml(template.name)}"
+                          data-scope="project"
+                          data-action="install-template"
+                          title="${t('mcp.installToProject')}">
+                    <i data-lucide="download" class="w-3 h-3"></i>
+                    ${t('mcp.toProject')}
+                  </button>
+                  <button class="text-xs text-success hover:text-success/80 transition-colors flex items-center gap-1"
+                          data-template-name="${escapeHtml(template.name)}"
+                          data-scope="global"
+                          data-action="install-template"
+                          title="${t('mcp.installToGlobal')}">
+                    <i data-lucide="globe" class="w-3 h-3"></i>
+                    ${t('mcp.toGlobal')}
+                  </button>
+                </div>
+                <button class="text-xs text-destructive hover:text-destructive/80 transition-colors"
+                        data-template-name="${escapeHtml(template.name)}"
+                        data-action="delete-template"
+                        title="${t('mcp.deleteTemplate')}">
+                  <i data-lucide="trash-2" class="w-3 h-3"></i>
+                </button>
+              </div>
+            </div>
+          `).join('')}
+        </div>
+      </div>
+      ` : ''}
+
      <!-- All Projects MCP Overview Table -->
      <div class="mcp-section mt-6">
        <div class="flex items-center justify-between mb-4">
@@ -402,15 +476,25 @@ function renderProjectAvailableServerCard(entry) {
        ` : ''}
      </div>

-      <div class="mt-3 pt-3 border-t border-border flex items-center justify-between">
-        <button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
-                data-server-name="${escapeHtml(name)}"
-                data-server-config="${escapeHtml(JSON.stringify(config))}"
-                data-scope="${source === 'global' ? 'global' : 'project'}"
-                data-action="copy-install-cmd">
-          <i data-lucide="copy" class="w-3 h-3"></i>
-          ${t('mcp.copyInstallCmd')}
-        </button>
+      <div class="mt-3 pt-3 border-t border-border flex items-center justify-between gap-2">
+        <div class="flex items-center gap-2">
+          <button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
+                  data-server-name="${escapeHtml(name)}"
+                  data-server-config="${escapeHtml(JSON.stringify(config))}"
+                  data-scope="${source === 'global' ? 'global' : 'project'}"
+                  data-action="copy-install-cmd">
+            <i data-lucide="copy" class="w-3 h-3"></i>
+            ${t('mcp.copyInstallCmd')}
+          </button>
+          <button class="text-xs text-success hover:text-success/80 transition-colors flex items-center gap-1"
+                  data-server-name="${escapeHtml(name)}"
+                  data-server-config="${escapeHtml(JSON.stringify(config))}"
+                  data-action="save-as-template"
+                  title="${t('mcp.saveAsTemplate')}">
+            <i data-lucide="save" class="w-3 h-3"></i>
+            ${t('mcp.saveAsTemplate')}
+          </button>
+        </div>
        ${canRemove ? `
          <button class="text-xs text-destructive hover:text-destructive/80 transition-colors"
                  data-server-name="${escapeHtml(name)}"
@@ -617,4 +701,156 @@ function attachMcpEventListeners() {
      await copyMcpInstallCommand(serverName, serverConfig, scope);
    });
  });
+
+  // Save as template buttons
+  document.querySelectorAll('.mcp-server-card button[data-action="save-as-template"]').forEach(btn => {
+    btn.addEventListener('click', async (e) => {
+      const serverName = btn.dataset.serverName;
+      const serverConfig = JSON.parse(btn.dataset.serverConfig);
+      await saveMcpAsTemplate(serverName, serverConfig);
+    });
+  });
+
+  // Install from template buttons
+  document.querySelectorAll('.mcp-template-card button[data-action="install-template"]').forEach(btn => {
+    btn.addEventListener('click', async (e) => {
+      const templateName = btn.dataset.templateName;
+      const scope = btn.dataset.scope || 'project';
+      await installFromTemplate(templateName, scope);
+    });
+  });
+
+  // Delete template buttons
+  document.querySelectorAll('.mcp-template-card button[data-action="delete-template"]').forEach(btn => {
+    btn.addEventListener('click', async (e) => {
+      const templateName = btn.dataset.templateName;
+      if (confirm(t('mcp.deleteTemplateConfirm', { name: templateName }))) {
+        await deleteMcpTemplate(templateName);
+      }
+    });
+  });
+}
+
+// ========================================
+// MCP Template Management Functions
+// ========================================
+
+let mcpTemplates = [];
+
+/**
+ * Load all MCP templates from API
+ */
+async function loadMcpTemplates() {
+  try {
+    const response = await fetch('/api/mcp-templates');
+    const data = await response.json();
+
+    if (data.success) {
+      mcpTemplates = data.templates || [];
+      console.log('[MCP Templates] Loaded', mcpTemplates.length, 'templates');
+    } else {
+      console.error('[MCP Templates] Failed to load:', data.error);
+      mcpTemplates = [];
+    }
+
+    return mcpTemplates;
+  } catch (error) {
+    console.error('[MCP Templates] Error loading templates:', error);
+    mcpTemplates = [];
+    return [];
+  }
+}
+
+/**
+ * Save MCP server configuration as a template
+ */
+async function saveMcpAsTemplate(serverName, serverConfig) {
+  try {
+    // Prompt for template name and description
+    const templateName = prompt(t('mcp.enterTemplateName'), serverName);
+    if (!templateName) return;
+
+    const description = prompt(t('mcp.enterTemplateDesc'), `Template for ${serverName}`);
+
+    const payload = {
+      name: templateName,
+      description: description || '',
+      serverConfig: serverConfig,
+      category: 'user'
+    };
+
+    const response = await fetch('/api/mcp-templates', {
+      method: 'POST',
+      headers: { 'Content-Type': 'application/json' },
+      body: JSON.stringify(payload)
+    });
+
+    const data = await response.json();
+
+    if (data.success) {
+      showNotification(t('mcp.templateSaved', { name: templateName }), 'success');
+      await loadMcpTemplates();
+      await renderMcpManager(); // Refresh view
+    } else {
+      showNotification(t('mcp.templateSaveFailed', { error: data.error }), 'error');
+    }
+  } catch (error) {
+    console.error('[MCP] Save template error:', error);
+    showNotification(t('mcp.templateSaveFailed', { error: error.message }), 'error');
+  }
+}
+
+/**
+ * Install MCP server from template
+ */
+async function installFromTemplate(templateName, scope = 'project') {
+  try {
+    // Find template
+    const template = mcpTemplates.find(t => t.name === templateName);
+    if (!template) {
+      showNotification(t('mcp.templateNotFound', { name: templateName }), 'error');
+      return;
+    }
+
+    // Prompt for server name (default to template name)
+    const serverName = prompt(t('mcp.enterServerName'), templateName);
+    if (!serverName) return;
+
+    // Install based on scope
+    if (scope === 'project') {
+      await installMcpToProject(serverName, template.serverConfig);
+    } else if (scope === 'global') {
+      await addGlobalMcpServer(serverName, template.serverConfig);
+    }
+
+    showNotification(t('mcp.templateInstalled', { name: serverName }), 'success');
+    await renderMcpManager();
+  } catch (error) {
+    console.error('[MCP] Install from template error:', error);
+    showNotification(t('mcp.templateInstallFailed', { error: error.message }), 'error');
+  }
+}
+
+/**
+ * Delete MCP template
+ */
+async function deleteMcpTemplate(templateName) {
+  try {
+    const response = await fetch(`/api/mcp-templates/${encodeURIComponent(templateName)}`, {
+      method: 'DELETE'
+    });
+
+    const data = await response.json();
+
+    if (data.success) {
+      showNotification(t('mcp.templateDeleted', { name: templateName }), 'success');
+      await loadMcpTemplates();
+      await renderMcpManager();
+    } else {
+      showNotification(t('mcp.templateDeleteFailed', { error: data.error }), 'error');
+    }
+  } catch (error) {
+    console.error('[MCP] Delete template error:', error);
+    showNotification(t('mcp.templateDeleteFailed', { error: error.message }), 'error');
+  }
 }
--- a/ccw/src/templates/dashboard-js/views/rules-manager.js
+++ b/ccw/src/templates/dashboard-js/views/rules-manager.js
@@ -588,9 +588,21 @@ function closeRuleCreateModal(event) {

 function selectRuleLocation(location) {
  ruleCreateState.location = location;
-  // Re-render modal
-  closeRuleCreateModal();
-  openRuleCreateModal();
+
+  // Update button styles without re-rendering modal
+  const buttons = document.querySelectorAll('.location-btn');
+  buttons.forEach(btn => {
+    const isProject = btn.querySelector('.font-medium')?.textContent?.includes(t('rules.projectRules'));
+    const isUser = btn.querySelector('.font-medium')?.textContent?.includes(t('rules.userRules'));
+
+    if ((isProject && location === 'project') || (isUser && location === 'user')) {
+      btn.classList.remove('border-border', 'hover:border-primary/50');
+      btn.classList.add('border-primary', 'bg-primary/10');
+    } else {
+      btn.classList.remove('border-primary', 'bg-primary/10');
+      btn.classList.add('border-border', 'hover:border-primary/50');
+    }
+  });
 }

 function toggleRuleConditional() {
--- a/ccw/src/templates/dashboard-js/views/skills-manager.js
+++ b/ccw/src/templates/dashboard-js/views/skills-manager.js
@@ -569,9 +569,21 @@ function closeSkillCreateModal(event) {

 function selectSkillLocation(location) {
  skillCreateState.location = location;
-  // Re-render modal
-  closeSkillCreateModal();
-  openSkillCreateModal();
+
+  // Update button styles without re-rendering modal
+  const buttons = document.querySelectorAll('.location-btn');
+  buttons.forEach(btn => {
+    const isProject = btn.querySelector('.font-medium')?.textContent?.includes(t('skills.projectSkills'));
+    const isUser = btn.querySelector('.font-medium')?.textContent?.includes(t('skills.userSkills'));
+
+    if ((isProject && location === 'project') || (isUser && location === 'user')) {
+      btn.classList.remove('border-border', 'hover:border-primary/50');
+      btn.classList.add('border-primary', 'bg-primary/10');
+    } else {
+      btn.classList.remove('border-primary', 'bg-primary/10');
+      btn.classList.add('border-border', 'hover:border-primary/50');
+    }
+  });
 }

 function switchSkillCreateMode(mode) {
--- a/codex-lens/docs/DESIGN_EVALUATION_REPORT.md
+++ b/codex-lens/docs/DESIGN_EVALUATION_REPORT.md
--- a/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
+++ b/codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
@@ -0,0 +1,972 @@
+# Docstring与LLM混合策略设计方案
+
+## 1. 背景与目标
+
+### 1.1 当前问题
+
+现有 `llm_enhancer.py` 的实现存在以下问题：
+
+1. **忽略已有文档**：对所有代码无差别调用LLM，即使已有高质量的docstring
+2. **成本浪费**：重复生成已有信息，增加API调用费用和时间
+3. **信息质量不一致**：LLM生成的内容可能不如作者编写的docstring准确
+4. **缺少作者意图**：丢失了docstring中的设计决策、使用示例等关键信息
+
+### 1.2 设计目标
+
+实现**智能混合策略**，结合docstring和LLM的优势：
+
+1. **优先使用docstring**：作为最权威的信息源
+2. **LLM作为补充**：填补docstring缺失或质量不足的部分
+3. **智能质量评估**：自动判断docstring质量，决定是否需要LLM增强
+4. **成本优化**：减少不必要的LLM调用，降低API费用
+5. **信息融合**：将docstring和LLM生成的内容有机结合
+
+## 2. 技术架构
+
+### 2.1 整体流程
+
+```
+Code Symbol
+    ↓
+[Docstring Extractor] ← 提取docstring
+    ↓
+[Quality Evaluator]   ← 评估docstring质量
+    ↓
+    ├─ High Quality → Use Docstring Directly
+    │                 + LLM Generate Keywords Only
+    │
+    ├─ Medium Quality → LLM Refine & Enhance
+    │                   (docstring作为base)
+    │
+    └─ Low/No Docstring → LLM Full Generation
+                          (现有流程)
+    ↓
+[Metadata Merger]     ← 合并docstring和LLM内容
+    ↓
+Final SemanticMetadata
+```
+
+### 2.2 核心组件
+
+```python
+from dataclasses import dataclass
+from enum import Enum
+from typing import Optional
+
+class DocstringQuality(Enum):
+    """Docstring质量等级"""
+    MISSING = "missing"           # 无docstring
+    LOW = "low"                   # 质量低：<10字符或纯占位符
+    MEDIUM = "medium"             # 质量中：有基本描述但不完整
+    HIGH = "high"                 # 质量高：详细且结构化
+
+@dataclass
+class DocstringMetadata:
+    """从docstring提取的元数据"""
+    raw_text: str
+    quality: DocstringQuality
+    summary: Optional[str] = None       # 提取的摘要
+    parameters: Optional[dict] = None   # 参数说明
+    returns: Optional[str] = None       # 返回值说明
+    examples: Optional[str] = None      # 使用示例
+    notes: Optional[str] = None         # 注意事项
+```
+
+## 3. 详细实现步骤
+
+### 3.1 Docstring提取与解析
+
+```python
+import re
+from typing import Optional
+
+class DocstringExtractor:
+    """Docstring提取器"""
+
+    # Docstring风格正则
+    GOOGLE_STYLE_PATTERN = re.compile(
+        r'Args:|Returns:|Raises:|Examples:|Note:',
+        re.MULTILINE
+    )
+
+    NUMPY_STYLE_PATTERN = re.compile(
+        r'Parameters\n-+|Returns\n-+|Examples\n-+',
+        re.MULTILINE
+    )
+
+    def extract_from_code(self, content: str, symbol: Symbol) -> Optional[str]:
+        """从代码中提取docstring"""
+
+        lines = content.splitlines()
+        start_line = symbol.range[0] - 1  # 0-indexed
+
+        # 查找函数定义后的第一个字符串字面量
+        # 通常在函数定义的下一行或几行内
+        for i in range(start_line + 1, min(start_line + 10, len(lines))):
+            line = lines[i].strip()
+
+            # Python triple-quoted string
+            if line.startswith('"""') or line.startswith("'''"):
+                return self._extract_multiline_docstring(lines, i)
+
+        return None
+
+    def _extract_multiline_docstring(
+        self,
+        lines: List[str],
+        start_idx: int
+    ) -> str:
+        """提取多行docstring"""
+
+        quote_char = '"""' if lines[start_idx].strip().startswith('"""') else "'''"
+        docstring_lines = []
+
+        # 检查是否单行docstring
+        first_line = lines[start_idx].strip()
+        if first_line.count(quote_char) == 2:
+            # 单行: """This is a docstring."""
+            return first_line.strip(quote_char).strip()
+
+        # 多行docstring
+        in_docstring = True
+        for i in range(start_idx, len(lines)):
+            line = lines[i]
+
+            if i == start_idx:
+                # 第一行：移除开始的引号
+                docstring_lines.append(line.strip().lstrip(quote_char))
+            elif quote_char in line:
+                # 结束行：移除结束的引号
+                docstring_lines.append(line.strip().rstrip(quote_char))
+                break
+            else:
+                docstring_lines.append(line.strip())
+
+        return '\n'.join(docstring_lines).strip()
+
+    def parse_docstring(self, raw_docstring: str) -> DocstringMetadata:
+        """解析docstring，提取结构化信息"""
+
+        if not raw_docstring:
+            return DocstringMetadata(
+                raw_text="",
+                quality=DocstringQuality.MISSING
+            )
+
+        # 评估质量
+        quality = self._evaluate_quality(raw_docstring)
+
+        # 提取各个部分
+        metadata = DocstringMetadata(
+            raw_text=raw_docstring,
+            quality=quality,
+        )
+
+        # 提取摘要（第一行或第一段）
+        metadata.summary = self._extract_summary(raw_docstring)
+
+        # 如果是Google或NumPy风格，提取结构化内容
+        if self.GOOGLE_STYLE_PATTERN.search(raw_docstring):
+            self._parse_google_style(raw_docstring, metadata)
+        elif self.NUMPY_STYLE_PATTERN.search(raw_docstring):
+            self._parse_numpy_style(raw_docstring, metadata)
+
+        return metadata
+
+    def _evaluate_quality(self, docstring: str) -> DocstringQuality:
+        """评估docstring质量"""
+
+        if not docstring or len(docstring.strip()) == 0:
+            return DocstringQuality.MISSING
+
+        # 检查是否是占位符
+        placeholders = ['todo', 'fixme', 'tbd', 'placeholder', '...']
+        if any(p in docstring.lower() for p in placeholders):
+            return DocstringQuality.LOW
+
+        # 长度检查
+        if len(docstring.strip()) < 10:
+            return DocstringQuality.LOW
+
+        # 检查是否有结构化内容
+        has_structure = (
+            self.GOOGLE_STYLE_PATTERN.search(docstring) or
+            self.NUMPY_STYLE_PATTERN.search(docstring)
+        )
+
+        # 检查是否有足够的描述性文本
+        word_count = len(docstring.split())
+
+        if has_structure and word_count >= 20:
+            return DocstringQuality.HIGH
+        elif word_count >= 10:
+            return DocstringQuality.MEDIUM
+        else:
+            return DocstringQuality.LOW
+
+    def _extract_summary(self, docstring: str) -> str:
+        """提取摘要（第一行或第一段）"""
+
+        lines = docstring.split('\n')
+        # 第一行非空行作为摘要
+        for line in lines:
+            if line.strip():
+                return line.strip()
+
+        return ""
+
+    def _parse_google_style(self, docstring: str, metadata: DocstringMetadata):
+        """解析Google风格docstring"""
+
+        # 提取Args
+        args_match = re.search(r'Args:(.*?)(?=Returns:|Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
+        if args_match:
+            metadata.parameters = self._parse_args_section(args_match.group(1))
+
+        # 提取Returns
+        returns_match = re.search(r'Returns:(.*?)(?=Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
+        if returns_match:
+            metadata.returns = returns_match.group(1).strip()
+
+        # 提取Examples
+        examples_match = re.search(r'Examples:(.*?)(?=Note:|\Z)', docstring, re.DOTALL)
+        if examples_match:
+            metadata.examples = examples_match.group(1).strip()
+
+    def _parse_args_section(self, args_text: str) -> dict:
+        """解析参数列表"""
+
+        params = {}
+        # 匹配 "param_name (type): description" 或 "param_name: description"
+        pattern = re.compile(r'(\w+)\s*(?:\(([^)]+)\))?\s*:\s*(.+)')
+
+        for line in args_text.split('\n'):
+            match = pattern.search(line.strip())
+            if match:
+                param_name, param_type, description = match.groups()
+                params[param_name] = {
+                    'type': param_type,
+                    'description': description.strip()
+                }
+
+        return params
+```
+
+### 3.2 智能混合策略引擎
+
+```python
+class HybridEnhancer:
+    """Docstring与LLM混合增强器"""
+
+    def __init__(
+        self,
+        llm_enhancer: LLMEnhancer,
+        docstring_extractor: DocstringExtractor
+    ):
+        self.llm_enhancer = llm_enhancer
+        self.docstring_extractor = docstring_extractor
+
+    def enhance_with_strategy(
+        self,
+        file_data: FileData,
+        symbols: List[Symbol]
+    ) -> Dict[str, SemanticMetadata]:
+        """根据docstring质量选择增强策略"""
+
+        results = {}
+
+        for symbol in symbols:
+            # 1. 提取并解析docstring
+            raw_docstring = self.docstring_extractor.extract_from_code(
+                file_data.content, symbol
+            )
+            doc_metadata = self.docstring_extractor.parse_docstring(raw_docstring or "")
+
+            # 2. 根据质量选择策略
+            semantic_metadata = self._apply_strategy(
+                file_data, symbol, doc_metadata
+            )
+
+            results[symbol.name] = semantic_metadata
+
+        return results
+
+    def _apply_strategy(
+        self,
+        file_data: FileData,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """应用混合策略"""
+
+        quality = doc_metadata.quality
+
+        if quality == DocstringQuality.HIGH:
+            # 高质量：直接使用docstring，只用LLM生成keywords
+            return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+        elif quality == DocstringQuality.MEDIUM:
+            # 中等质量：让LLM精炼和增强
+            return self._refine_with_llm(file_data, symbol, doc_metadata)
+
+        else:  # LOW or MISSING
+            # 低质量或无：完全由LLM生成
+            return self._full_llm_generation(file_data, symbol)
+
+    def _use_docstring_with_llm_keywords(
+        self,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """策略1：使用docstring，LLM只生成keywords"""
+
+        # 直接使用docstring的摘要
+        summary = doc_metadata.summary or doc_metadata.raw_text[:200]
+
+        # 使用LLM生成keywords
+        keywords = self._generate_keywords_only(summary, symbol.name)
+
+        # 从docstring推断purpose
+        purpose = self._infer_purpose_from_docstring(doc_metadata)
+
+        return SemanticMetadata(
+            summary=summary,
+            keywords=keywords,
+            purpose=purpose,
+            file_path=symbol.file_path if hasattr(symbol, 'file_path') else None,
+            symbol_name=symbol.name,
+            llm_tool="hybrid_docstring_primary",
+        )
+
+    def _refine_with_llm(
+        self,
+        file_data: FileData,
+        symbol: Symbol,
+        doc_metadata: DocstringMetadata
+    ) -> SemanticMetadata:
+        """策略2：让LLM精炼和增强docstring"""
+
+        prompt = f"""
+PURPOSE: Refine and enhance an existing docstring for better semantic search
+TASK:
+- Review the existing docstring
+- Generate a concise summary (1-2 sentences) that captures the core purpose
+- Extract 8-12 relevant keywords for search
+- Identify the functional category/purpose
+
+EXISTING DOCSTRING:
+{doc_metadata.raw_text}
+
+CODE CONTEXT:
+Function: {symbol.name}
+```{file_data.language}
+{self._get_symbol_code(file_data.content, symbol)}
+```
+
+OUTPUT: JSON format
+{{
+    "summary": "refined summary based on docstring and code",
+    "keywords": ["keyword1", "keyword2", ...],
+    "purpose": "category"
+}}
+"""
+
+        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
+        if response['success']:
+            data = json.loads(self.llm_enhancer._extract_json(response['stdout']))
+            return SemanticMetadata(
+                summary=data.get('summary', doc_metadata.summary),
+                keywords=data.get('keywords', []),
+                purpose=data.get('purpose', 'unknown'),
+                file_path=file_data.path,
+                symbol_name=symbol.name,
+                llm_tool="hybrid_llm_refined",
+            )
+
+        # Fallback: 使用docstring
+        return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+    def _full_llm_generation(
+        self,
+        file_data: FileData,
+        symbol: Symbol
+    ) -> SemanticMetadata:
+        """策略3：完全由LLM生成（原有流程）"""
+
+        # 复用现有的LLM enhancer
+        code_snippet = self._get_symbol_code(file_data.content, symbol)
+
+        results = self.llm_enhancer.enhance_files([
+            FileData(
+                path=f"{file_data.path}:{symbol.name}",
+                content=code_snippet,
+                language=file_data.language
+            )
+        ])
+
+        return results.get(f"{file_data.path}:{symbol.name}", SemanticMetadata(
+            summary="",
+            keywords=[],
+            purpose="unknown",
+            file_path=file_data.path,
+            symbol_name=symbol.name,
+            llm_tool="hybrid_llm_full",
+        ))
+
+    def _generate_keywords_only(self, summary: str, symbol_name: str) -> List[str]:
+        """仅生成keywords（快速LLM调用）"""
+
+        prompt = f"""
+PURPOSE: Generate search keywords for a code function
+TASK: Extract 5-8 relevant keywords from the summary
+
+Summary: {summary}
+Function Name: {symbol_name}
+
+OUTPUT: Comma-separated keywords
+"""
+
+        response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
+        if response['success']:
+            keywords_str = response['stdout'].strip()
+            return [k.strip() for k in keywords_str.split(',')]
+
+        # Fallback: 从摘要提取关键词
+        return self._extract_keywords_heuristic(summary)
+
+    def _extract_keywords_heuristic(self, text: str) -> List[str]:
+        """启发式关键词提取（无需LLM）"""
+
+        # 简单实现：提取名词性词组
+        import re
+        words = re.findall(r'\b[a-z]{4,}\b', text.lower())
+
+        # 过滤常见词
+        stopwords = {'this', 'that', 'with', 'from', 'have', 'will', 'your', 'their'}
+        keywords = [w for w in words if w not in stopwords]
+
+        return list(set(keywords))[:8]
+
+    def _infer_purpose_from_docstring(self, doc_metadata: DocstringMetadata) -> str:
+        """从docstring推断purpose（无需LLM）"""
+
+        summary = doc_metadata.summary.lower()
+
+        # 简单规则匹配
+        if 'authenticate' in summary or 'login' in summary:
+            return 'auth'
+        elif 'validate' in summary or 'check' in summary:
+            return 'validation'
+        elif 'parse' in summary or 'format' in summary:
+            return 'data_processing'
+        elif 'api' in summary or 'endpoint' in summary:
+            return 'api'
+        elif 'database' in summary or 'query' in summary:
+            return 'data'
+        elif 'test' in summary:
+            return 'test'
+
+        return 'util'
+
+    def _get_symbol_code(self, content: str, symbol: Symbol) -> str:
+        """提取符号的代码"""
+
+        lines = content.splitlines()
+        start, end = symbol.range
+        return '\n'.join(lines[start-1:end])
+```
+
+### 3.3 成本优化统计
+
+```python
+@dataclass
+class EnhancementStats:
+    """增强统计"""
+    total_symbols: int = 0
+    used_docstring_only: int = 0      # 只使用docstring
+    llm_keywords_only: int = 0        # LLM只生成keywords
+    llm_refined: int = 0              # LLM精炼docstring
+    llm_full_generation: int = 0      # LLM完全生成
+    total_llm_calls: int = 0
+    estimated_cost_savings: float = 0.0  # 相比全用LLM节省的成本
+
+class CostOptimizedEnhancer(HybridEnhancer):
+    """带成本统计的增强器"""
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.stats = EnhancementStats()
+
+    def enhance_with_strategy(
+        self,
+        file_data: FileData,
+        symbols: List[Symbol]
+    ) -> Dict[str, SemanticMetadata]:
+        """增强并统计成本"""
+
+        self.stats.total_symbols += len(symbols)
+        results = super().enhance_with_strategy(file_data, symbols)
+
+        # 统计各策略使用情况
+        for metadata in results.values():
+            if metadata.llm_tool == "hybrid_docstring_primary":
+                self.stats.used_docstring_only += 1
+                self.stats.llm_keywords_only += 1
+                self.stats.total_llm_calls += 1
+            elif metadata.llm_tool == "hybrid_llm_refined":
+                self.stats.llm_refined += 1
+                self.stats.total_llm_calls += 1
+            elif metadata.llm_tool == "hybrid_llm_full":
+                self.stats.llm_full_generation += 1
+                self.stats.total_llm_calls += 1
+
+        # 计算成本节省（假设：keywords-only调用成本为full的20%）
+        keywords_only_savings = self.stats.llm_keywords_only * 0.8  # 节省80%
+        full_generation_count = self.stats.total_symbols - self.stats.llm_keywords_only
+        self.stats.estimated_cost_savings = keywords_only_savings / full_generation_count if full_generation_count > 0 else 0
+
+        return results
+
+    def print_stats(self):
+        """打印统计信息"""
+
+        print("=== Enhancement Statistics ===")
+        print(f"Total Symbols: {self.stats.total_symbols}")
+        print(f"Used Docstring (with LLM keywords): {self.stats.used_docstring_only} ({self.stats.used_docstring_only/self.stats.total_symbols*100:.1f}%)")
+        print(f"LLM Refined Docstring: {self.stats.llm_refined} ({self.stats.llm_refined/self.stats.total_symbols*100:.1f}%)")
+        print(f"LLM Full Generation: {self.stats.llm_full_generation} ({self.stats.llm_full_generation/self.stats.total_symbols*100:.1f}%)")
+        print(f"Total LLM Calls: {self.stats.total_llm_calls}")
+        print(f"Estimated Cost Savings: {self.stats.estimated_cost_savings*100:.1f}%")
+```
+
+## 4. 配置选项
+
+```python
+@dataclass
+class HybridEnhancementConfig:
+    """混合增强配置"""
+
+    # 是否启用混合策略（False则回退到全LLM模式）
+    enable_hybrid: bool = True
+
+    # 质量阈值配置
+    use_docstring_threshold: DocstringQuality = DocstringQuality.HIGH
+    refine_docstring_threshold: DocstringQuality = DocstringQuality.MEDIUM
+
+    # 是否为高质量docstring生成keywords
+    generate_keywords_for_docstring: bool = True
+
+    # LLM配置
+    llm_tool: str = "gemini"
+    llm_timeout: int = 300000
+
+    # 成本优化
+    batch_size: int = 5              # 批量处理大小
+    skip_test_files: bool = True     # 跳过测试文件（通常docstring较少）
+
+    # 调试选项
+    log_strategy_decisions: bool = False  # 记录策略决策日志
+```
+
+## 5. 测试策略
+
+### 5.1 单元测试
+
+```python
+import pytest
+
+class TestDocstringExtractor:
+    """测试docstring提取"""
+
+    def test_extract_google_style(self):
+        """测试Google风格docstring提取"""
+        code = '''
+def calculate_total(items, discount=0):
+    """Calculate total price with optional discount.
+
+    This function processes a list of items and applies
+    a discount if specified.
+
+    Args:
+        items (list): List of item objects with price attribute.
+        discount (float): Discount percentage (0-1). Defaults to 0.
+
+    Returns:
+        float: Total price after discount.
+
+    Examples:
+        >>> calculate_total([item1, item2], discount=0.1)
+        90.0
+    """
+    total = sum(item.price for item in items)
+    return total * (1 - discount)
+'''
+        extractor = DocstringExtractor()
+        symbol = Symbol(name='calculate_total', kind='function', range=(1, 18))
+        docstring = extractor.extract_from_code(code, symbol)
+
+        assert docstring is not None
+        metadata = extractor.parse_docstring(docstring)
+
+        assert metadata.quality == DocstringQuality.HIGH
+        assert 'Calculate total price' in metadata.summary
+        assert metadata.parameters is not None
+        assert 'items' in metadata.parameters
+        assert metadata.returns is not None
+        assert metadata.examples is not None
+
+    def test_extract_low_quality_docstring(self):
+        """测试低质量docstring识别"""
+        code = '''
+def process():
+    """TODO"""
+    pass
+'''
+        extractor = DocstringExtractor()
+        symbol = Symbol(name='process', kind='function', range=(1, 3))
+        docstring = extractor.extract_from_code(code, symbol)
+
+        metadata = extractor.parse_docstring(docstring)
+        assert metadata.quality == DocstringQuality.LOW
+
+class TestHybridEnhancer:
+    """测试混合增强器"""
+
+    def test_high_quality_docstring_strategy(self):
+        """测试高质量docstring使用策略"""
+
+        extractor = DocstringExtractor()
+        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
+        hybrid = HybridEnhancer(llm_enhancer, extractor)
+
+        # 模拟高质量docstring
+        doc_metadata = DocstringMetadata(
+            raw_text="Validate user credentials against database.",
+            quality=DocstringQuality.HIGH,
+            summary="Validate user credentials against database."
+        )
+
+        symbol = Symbol(name='validate_user', kind='function', range=(1, 10))
+
+        result = hybrid._use_docstring_with_llm_keywords(symbol, doc_metadata)
+
+        # 应该使用docstring的摘要
+        assert result.summary == doc_metadata.summary
+        # 应该有keywords（可能由LLM或启发式生成）
+        assert len(result.keywords) > 0
+
+    def test_cost_optimization(self):
+        """测试成本优化效果"""
+
+        enhancer = CostOptimizedEnhancer(
+            llm_enhancer=LLMEnhancer(LLMConfig(enabled=False)),  # Mock
+            docstring_extractor=DocstringExtractor()
+        )
+
+        # 模拟处理10个symbol，其中5个有高质量docstring
+        # 预期：5个只调用keywords生成，5个完整LLM
+        # 总调用10次，但成本降低（keywords调用更便宜）
+
+        # 实际测试需要mock LLM调用
+        pass
+```
+
+### 5.2 集成测试
+
+```python
+class TestHybridEnhancementPipeline:
+    """测试完整的混合增强流程"""
+
+    def test_full_pipeline(self):
+        """测试完整流程：代码 -> docstring提取 -> 质量评估 -> 策略选择 -> 增强"""
+
+        code = '''
+def authenticate_user(username, password):
+    """Authenticate user with username and password.
+
+    Args:
+        username (str): User's username
+        password (str): User's password
+
+    Returns:
+        bool: True if authenticated, False otherwise
+    """
+    # ... implementation
+    pass
+
+def helper_func(x):
+    # No docstring
+    return x * 2
+'''
+
+        file_data = FileData(path='auth.py', content=code, language='python')
+        symbols = [
+            Symbol(name='authenticate_user', kind='function', range=(1, 11)),
+            Symbol(name='helper_func', kind='function', range=(13, 15)),
+        ]
+
+        extractor = DocstringExtractor()
+        llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
+        hybrid = CostOptimizedEnhancer(llm_enhancer, extractor)
+
+        results = hybrid.enhance_with_strategy(file_data, symbols)
+
+        # authenticate_user 应该使用docstring
+        assert results['authenticate_user'].llm_tool == "hybrid_docstring_primary"
+
+        # helper_func 应该完全LLM生成
+        assert results['helper_func'].llm_tool == "hybrid_llm_full"
+
+        # 统计
+        assert hybrid.stats.total_symbols == 2
+        assert hybrid.stats.used_docstring_only >= 1
+        assert hybrid.stats.llm_full_generation >= 1
+```
+
+## 6. 实施路线图
+
+### Phase 1: 基础设施（1周）
+- [x] 设计数据结构（DocstringMetadata, DocstringQuality）
+- [ ] 实现DocstringExtractor（提取和解析）
+- [ ] 支持Python docstring（Google/NumPy/reStructuredText风格）
+- [ ] 单元测试
+
+### Phase 2: 质量评估（1周）
+- [ ] 实现质量评估算法
+- [ ] 启发式规则优化
+- [ ] 测试不同质量的docstring
+- [ ] 调整阈值参数
+
+### Phase 3: 混合策略（1-2周）
+- [ ] 实现HybridEnhancer
+- [ ] 三种策略实现（docstring-only, refine, full-llm）
+- [ ] 策略选择逻辑
+- [ ] 集成测试
+
+### Phase 4: 成本优化（1周）
+- [ ] 实现CostOptimizedEnhancer
+- [ ] 统计和监控
+- [ ] 批量处理优化
+- [ ] 性能测试
+
+### Phase 5: 多语言支持（1-2周）
+- [ ] JavaScript/TypeScript JSDoc
+- [ ] Java Javadoc
+- [ ] 其他语言docstring格式
+
+### Phase 6: 集成与部署（1周）
+- [ ] 集成到现有llm_enhancer
+- [ ] CLI选项暴露
+- [ ] 配置文件支持
+- [ ] 文档和示例
+
+**总计预估时间**：6-8周
+
+## 7. 性能与成本分析
+
+### 7.1 预期成本节省
+
+假设场景：分析1000个函数
+
+| Docstring质量分布 | 占比 | LLM调用策略 | 相对成本 |
+|------------------|------|------------|---------|
+| High (有详细docstring) | 30% | 只生成keywords | 20% |
+| Medium (有基本docstring) | 40% | 精炼增强 | 60% |
+| Low/Missing | 30% | 完全生成 | 100% |
+
+**总成本计算**：
+- 纯LLM模式：1000 * 100% = 1000 units
+- 混合模式：300*20% + 400*60% + 300*100% = 60 + 240 + 300 = 600 units
+- **节省**：40%
+
+### 7.2 质量对比
+
+| 指标 | 纯LLM模式 | 混合模式 |
+|------|----------|---------|
+| 准确性 | 中（可能有幻觉） | **高**（docstring权威） |
+| 一致性 | 中（依赖prompt） | **高**（保留作者风格） |
+| 覆盖率 | **高**（全覆盖） | 高（98%+） |
+| 成本 | 高 | **低**（节省40%） |
+| 速度 | 慢（所有文件） | **快**（减少LLM调用） |
+
+## 8. 潜在问题与解决方案
+
+### 8.1 问题：Docstring过时
+
+**现象**：代码已修改，但docstring未更新，导致信息不准确。
+
+**解决方案**：
+```python
+class DocstringFreshnessChecker:
+    """检查docstring与代码的一致性"""
+
+    def check_freshness(
+        self,
+        symbol: Symbol,
+        code: str,
+        doc_metadata: DocstringMetadata
+    ) -> bool:
+        """检查docstring是否与代码匹配"""
+
+        # 检查1: 参数列表是否匹配
+        if doc_metadata.parameters:
+            actual_params = self._extract_actual_parameters(code)
+            documented_params = set(doc_metadata.parameters.keys())
+
+            if actual_params != documented_params:
+                logger.warning(
+                    f"Parameter mismatch in {symbol.name}: "
+                    f"code has {actual_params}, doc has {documented_params}"
+                )
+                return False
+
+        # 检查2: 使用LLM验证一致性
+        # TODO: 构建验证prompt
+
+        return True
+```
+
+### 8.2 问题：不同docstring风格混用
+
+**现象**：同一项目中使用多种docstring风格（Google, NumPy, 自定义）。
+
+**解决方案**：
+```python
+class MultiStyleDocstringParser:
+    """支持多种docstring风格的解析器"""
+
+    def parse(self, docstring: str) -> DocstringMetadata:
+        """自动检测并解析不同风格"""
+
+        # 尝试各种解析器
+        for parser in [
+            GoogleStyleParser(),
+            NumpyStyleParser(),
+            ReStructuredTextParser(),
+            SimpleParser(),  # Fallback
+        ]:
+            try:
+                metadata = parser.parse(docstring)
+                if metadata.quality != DocstringQuality.LOW:
+                    return metadata
+            except Exception:
+                continue
+
+        # 如果所有解析器都失败，返回简单解析结果
+        return SimpleParser().parse(docstring)
+```
+
+### 8.3 问题：多语言docstring提取差异
+
+**现象**：不同语言的docstring格式和位置不同。
+
+**解决方案**：
+```python
+class LanguageSpecificExtractor:
+    """语言特定的docstring提取器"""
+
+    def extract(self, language: str, code: str, symbol: Symbol) -> Optional[str]:
+        """根据语言选择合适的提取器"""
+
+        extractors = {
+            'python': PythonDocstringExtractor(),
+            'javascript': JSDocExtractor(),
+            'typescript': TSDocExtractor(),
+            'java': JavadocExtractor(),
+        }
+
+        extractor = extractors.get(language, GenericExtractor())
+        return extractor.extract(code, symbol)
+
+class JSDocExtractor:
+    """JavaScript/TypeScript JSDoc提取器"""
+
+    def extract(self, code: str, symbol: Symbol) -> Optional[str]:
+        """提取JSDoc注释"""
+
+        lines = code.splitlines()
+        start_line = symbol.range[0] - 1
+
+        # 向上查找 /** ... */ 注释
+        for i in range(start_line - 1, max(0, start_line - 20), -1):
+            if '*/' in lines[i]:
+                # 找到结束标记，向上提取
+                return self._extract_jsdoc_block(lines, i)
+
+        return None
+```
+
+## 9. 配置示例
+
+### 9.1 配置文件
+
+```yaml
+# .codexlens/hybrid_enhancement.yaml
+
+hybrid_enhancement:
+  enabled: true
+
+  # 质量阈值
+  quality_thresholds:
+    use_docstring: high      # high/medium/low
+    refine_docstring: medium
+
+  # LLM选项
+  llm:
+    tool: gemini
+    fallback: qwen
+    timeout_ms: 300000
+    batch_size: 5
+
+  # 成本优化
+  cost_optimization:
+    generate_keywords_for_docstring: true
+    skip_test_files: true
+    skip_private_methods: false
+
+  # 语言支持
+  languages:
+    python:
+      styles: [google, numpy, sphinx]
+    javascript:
+      styles: [jsdoc]
+    java:
+      styles: [javadoc]
+
+  # 监控
+  logging:
+    log_strategy_decisions: false
+    log_cost_savings: true
+```
+
+### 9.2 CLI使用
+
+```bash
+# 使用混合策略增强
+codex-lens enhance . --hybrid --tool gemini
+
+# 查看成本统计
+codex-lens enhance . --hybrid --show-stats
+
+# 仅对高质量docstring生成keywords
+codex-lens enhance . --hybrid --keywords-only
+
+# 禁用混合模式，回退到纯LLM
+codex-lens enhance . --no-hybrid --tool gemini
+```
+
+## 10. 成功指标
+
+1. **成本节省**：相比纯LLM模式，降低API调用成本40%+
+2. **准确性提升**：使用docstring的符号，元数据准确率>95%
+3. **覆盖率**：98%+的符号有语义元数据（docstring或LLM生成）
+4. **速度提升**：整体处理速度提升30%+（减少LLM调用）
+5. **用户满意度**：保留docstring信息，开发者认可度高
+
+## 11. 参考资料
+
+- [PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
+- [Google Python Style Guide - Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
+- [NumPy Docstring Standard](https://numpydoc.readthedocs.io/en/latest/format.html)
+- [JSDoc Documentation](https://jsdoc.app/)
+- [Javadoc Tool](https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html)
--- a/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
+++ b/codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
@@ -0,0 +1,973 @@
+# 多层次分词器设计方案
+
+## 1. 背景与目标
+
+### 1.1 当前问题
+
+当前 `chunker.py` 的两种分词策略存在明显缺陷：
+
+**symbol-based 策略**：
+- ✅ 优点：保持代码逻辑完整性，每个chunk是完整的函数/类
+- ❌ 缺点：粒度不均，超大函数可能达到数百行，影响LLM处理和搜索精度
+
+**sliding-window 策略**：
+- ✅ 优点：chunk大小均匀，覆盖全面
+- ❌ 缺点：破坏逻辑结构，可能将完整的循环/条件块切断
+
+### 1.2 设计目标
+
+实现多层次分词器，同时满足：
+1. **语义完整性**：保持代码逻辑边界的完整性
+2. **粒度可控**：支持从粗粒度（函数级）到细粒度（逻辑块级）的灵活划分
+3. **层级关系**：保留chunk之间的父子关系，支持上下文检索
+4. **高效索引**：优化向量化和检索性能
+
+## 2. 技术架构
+
+### 2.1 两层分词架构
+
+```
+Source Code
+    ↓
+[Layer 1: Symbol-Level Chunking]  ← 使用 tree-sitter AST
+    ↓
+MacroChunks (Functions/Classes)
+    ↓
+[Layer 2: Logic-Block Chunking]   ← AST深度遍历
+    ↓
+MicroChunks (Loops/Conditionals/Blocks)
+    ↓
+Vector Embedding + Indexing
+```
+
+### 2.2 核心组件
+
+```python
+# 新增数据结构
+@dataclass
+class ChunkMetadata:
+    """Chunk元数据"""
+    chunk_id: str
+    parent_id: Optional[str]  # 父chunk ID
+    level: int                 # 层级：1=macro, 2=micro
+    chunk_type: str           # function/class/loop/conditional/try_except
+    file_path: str
+    start_line: int
+    end_line: int
+    symbol_name: Optional[str]
+    context_summary: Optional[str]  # 继承自父chunk的上下文
+
+@dataclass
+class HierarchicalChunk:
+    """层级化的代码块"""
+    metadata: ChunkMetadata
+    content: str
+    embedding: Optional[List[float]] = None
+    children: List['HierarchicalChunk'] = field(default_factory=list)
+```
+
+## 3. 详细实现步骤
+
+### 3.1 第一层：符号级分词（Macro-Chunking）
+
+**实现思路**：复用现有 `code_extractor.py` 逻辑，增强元数据提取。
+
+```python
+class MacroChunker:
+    """第一层分词器：提取顶层符号"""
+
+    def __init__(self):
+        self.parser = Parser()
+        # 加载语言grammar
+
+    def chunk_by_symbols(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """提取顶层函数和类定义"""
+        tree = self.parser.parse(bytes(content, 'utf-8'))
+        root_node = tree.root_node
+
+        chunks = []
+        for node in root_node.children:
+            if node.type in ['function_definition', 'class_definition',
+                           'method_definition']:
+                chunk = self._create_macro_chunk(node, content, file_path)
+                chunks.append(chunk)
+
+        return chunks
+
+    def _create_macro_chunk(
+        self,
+        node,
+        content: str,
+        file_path: str
+    ) -> HierarchicalChunk:
+        """从AST节点创建macro chunk"""
+        start_line = node.start_point[0] + 1
+        end_line = node.end_point[0] + 1
+
+        # 提取符号名称
+        name_node = node.child_by_field_name('name')
+        symbol_name = content[name_node.start_byte:name_node.end_byte]
+
+        # 提取完整代码（包含docstring和装饰器）
+        chunk_content = self._extract_with_context(node, content)
+
+        metadata = ChunkMetadata(
+            chunk_id=f"{file_path}:{start_line}",
+            parent_id=None,
+            level=1,
+            chunk_type=node.type,
+            file_path=file_path,
+            start_line=start_line,
+            end_line=end_line,
+            symbol_name=symbol_name,
+        )
+
+        return HierarchicalChunk(
+            metadata=metadata,
+            content=chunk_content,
+        )
+
+    def _extract_with_context(self, node, content: str) -> str:
+        """提取代码，包含装饰器和docstring"""
+        # 向上查找装饰器
+        start_byte = node.start_byte
+        prev_sibling = node.prev_sibling
+        while prev_sibling and prev_sibling.type == 'decorator':
+            start_byte = prev_sibling.start_byte
+            prev_sibling = prev_sibling.prev_sibling
+
+        return content[start_byte:node.end_byte]
+```
+
+### 3.2 第二层：逻辑块分词（Micro-Chunking）
+
+**实现思路**：在每个macro chunk内部，按逻辑结构进一步划分。
+
+```python
+class MicroChunker:
+    """第二层分词器：提取逻辑块"""
+
+    # 需要划分的逻辑块类型
+    LOGIC_BLOCK_TYPES = {
+        'for_statement',
+        'while_statement',
+        'if_statement',
+        'try_statement',
+        'with_statement',
+    }
+
+    def chunk_logic_blocks(
+        self,
+        macro_chunk: HierarchicalChunk,
+        content: str,
+        max_lines: int = 50  # 大于此行数的macro chunk才进行二次划分
+    ) -> List[HierarchicalChunk]:
+        """在macro chunk内部提取逻辑块"""
+
+        # 小函数不需要二次划分
+        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
+        if total_lines <= max_lines:
+            return []
+
+        tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
+        root_node = tree.root_node
+
+        micro_chunks = []
+        self._traverse_logic_blocks(
+            root_node,
+            macro_chunk,
+            content,
+            micro_chunks
+        )
+
+        return micro_chunks
+
+    def _traverse_logic_blocks(
+        self,
+        node,
+        parent_chunk: HierarchicalChunk,
+        content: str,
+        result: List[HierarchicalChunk]
+    ):
+        """递归遍历AST，提取逻辑块"""
+
+        if node.type in self.LOGIC_BLOCK_TYPES:
+            micro_chunk = self._create_micro_chunk(
+                node,
+                parent_chunk,
+                content
+            )
+            result.append(micro_chunk)
+            parent_chunk.children.append(micro_chunk)
+
+        # 继续遍历子节点
+        for child in node.children:
+            self._traverse_logic_blocks(child, parent_chunk, content, result)
+
+    def _create_micro_chunk(
+        self,
+        node,
+        parent_chunk: HierarchicalChunk,
+        content: str
+    ) -> HierarchicalChunk:
+        """创建micro chunk"""
+
+        # 计算相对于文件的行号
+        start_line = parent_chunk.metadata.start_line + node.start_point[0]
+        end_line = parent_chunk.metadata.start_line + node.end_point[0]
+
+        chunk_content = content[node.start_byte:node.end_byte]
+
+        metadata = ChunkMetadata(
+            chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
+            parent_id=parent_chunk.metadata.chunk_id,
+            level=2,
+            chunk_type=node.type,
+            file_path=parent_chunk.metadata.file_path,
+            start_line=start_line,
+            end_line=end_line,
+            symbol_name=parent_chunk.metadata.symbol_name,  # 继承父符号名
+            context_summary=None,  # 后续由LLM填充
+        )
+
+        return HierarchicalChunk(
+            metadata=metadata,
+            content=chunk_content,
+        )
+```
+
+### 3.3 统一接口：多层次分词器
+
+```python
+class HierarchicalChunker:
+    """多层次分词器统一接口"""
+
+    def __init__(self, config: ChunkConfig = None):
+        self.config = config or ChunkConfig()
+        self.macro_chunker = MacroChunker()
+        self.micro_chunker = MicroChunker()
+
+    def chunk_file(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """对文件进行多层次分词"""
+
+        # 第一层：符号级分词
+        macro_chunks = self.macro_chunker.chunk_by_symbols(
+            content, file_path, language
+        )
+
+        # 第二层：逻辑块分词
+        all_chunks = []
+        for macro_chunk in macro_chunks:
+            all_chunks.append(macro_chunk)
+
+            # 对大函数进行二次划分
+            micro_chunks = self.micro_chunker.chunk_logic_blocks(
+                macro_chunk, content
+            )
+            all_chunks.extend(micro_chunks)
+
+        return all_chunks
+
+    def chunk_file_with_fallback(
+        self,
+        content: str,
+        file_path: str,
+        language: str
+    ) -> List[HierarchicalChunk]:
+        """带降级策略的分词"""
+
+        try:
+            return self.chunk_file(content, file_path, language)
+        except Exception as e:
+            logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
+            # 降级到滑动窗口策略
+            return self._fallback_sliding_window(content, file_path, language)
+```
+
+## 4. 数据存储设计
+
+### 4.1 数据库Schema
+
+```sql
+-- chunk表：存储所有层级的chunk
+CREATE TABLE chunks (
+    chunk_id TEXT PRIMARY KEY,
+    parent_id TEXT,           -- 父chunk ID，NULL表示顶层
+    level INTEGER NOT NULL,   -- 1=macro, 2=micro
+    chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
+    file_path TEXT NOT NULL,
+    start_line INTEGER NOT NULL,
+    end_line INTEGER NOT NULL,
+    symbol_name TEXT,
+    content TEXT NOT NULL,
+    content_hash TEXT,        -- 用于检测内容变化
+
+    -- 语义元数据（由LLM生成）
+    summary TEXT,
+    keywords TEXT,            -- JSON数组
+    purpose TEXT,
+
+    -- 向量嵌入
+    embedding BLOB,           -- 存储向量
+
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+
+    FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
+);
+
+-- 索引优化
+CREATE INDEX idx_chunks_file_path ON chunks(file_path);
+CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
+CREATE INDEX idx_chunks_level ON chunks(level);
+CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);
+```
+
+### 4.2 向量索引
+
+使用分层索引策略：
+
+```python
+class HierarchicalVectorStore:
+    """层级化向量存储"""
+
+    def __init__(self, db_path: Path):
+        self.db_path = db_path
+        self.conn = sqlite3.connect(db_path)
+
+    def add_chunk(self, chunk: HierarchicalChunk):
+        """添加chunk及其向量"""
+
+        cursor = self.conn.cursor()
+        cursor.execute("""
+            INSERT INTO chunks (
+                chunk_id, parent_id, level, chunk_type,
+                file_path, start_line, end_line, symbol_name,
+                content, embedding
+            ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        """, (
+            chunk.metadata.chunk_id,
+            chunk.metadata.parent_id,
+            chunk.metadata.level,
+            chunk.metadata.chunk_type,
+            chunk.metadata.file_path,
+            chunk.metadata.start_line,
+            chunk.metadata.end_line,
+            chunk.metadata.symbol_name,
+            chunk.content,
+            self._serialize_embedding(chunk.embedding),
+        ))
+
+        self.conn.commit()
+
+    def search_hierarchical(
+        self,
+        query_embedding: List[float],
+        top_k: int = 10,
+        level_weights: Dict[int, float] = None
+    ) -> List[Tuple[HierarchicalChunk, float]]:
+        """层级化检索"""
+
+        # 默认权重：macro chunk权重更高
+        if level_weights is None:
+            level_weights = {1: 1.0, 2: 0.8}
+
+        # 检索所有chunk
+        cursor = self.conn.cursor()
+        cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")
+
+        results = []
+        for row in cursor.fetchall():
+            chunk = self._row_to_chunk(row)
+            similarity = self._cosine_similarity(
+                query_embedding,
+                chunk.embedding
+            )
+
+            # 根据层级应用权重
+            weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
+            results.append((chunk, weighted_score))
+
+        # 按分数排序
+        results.sort(key=lambda x: x[1], reverse=True)
+        return results[:top_k]
+
+    def get_chunk_with_context(
+        self,
+        chunk_id: str
+    ) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
+        """获取chunk及其父chunk（提供上下文）"""
+
+        cursor = self.conn.cursor()
+
+        # 获取chunk本身
+        cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
+        chunk_row = cursor.fetchone()
+        chunk = self._row_to_chunk(chunk_row)
+
+        # 获取父chunk
+        parent = None
+        if chunk.metadata.parent_id:
+            cursor.execute(
+                "SELECT * FROM chunks WHERE chunk_id = ?",
+                (chunk.metadata.parent_id,)
+            )
+            parent_row = cursor.fetchone()
+            if parent_row:
+                parent = self._row_to_chunk(parent_row)
+
+        return chunk, parent
+```
+
+## 5. LLM集成策略
+
+### 5.1 分层生成语义元数据
+
+```python
+class HierarchicalLLMEnhancer:
+    """为层级chunk生成语义元数据"""
+
+    def enhance_hierarchical_chunks(
+        self,
+        chunks: List[HierarchicalChunk]
+    ) -> Dict[str, SemanticMetadata]:
+        """
+        分层处理策略：
+        1. 先处理所有level=1的macro chunks，生成详细摘要
+        2. 再处理level=2的micro chunks，使用父chunk摘要作为上下文
+        """
+
+        results = {}
+
+        # 第一轮：处理macro chunks
+        macro_chunks = [c for c in chunks if c.metadata.level == 1]
+        macro_metadata = self.llm_enhancer.enhance_files([
+            FileData(
+                path=c.metadata.chunk_id,
+                content=c.content,
+                language=self._detect_language(c.metadata.file_path)
+            )
+            for c in macro_chunks
+        ])
+        results.update(macro_metadata)
+
+        # 第二轮：处理micro chunks（带父上下文）
+        micro_chunks = [c for c in chunks if c.metadata.level == 2]
+        for micro_chunk in micro_chunks:
+            parent_id = micro_chunk.metadata.parent_id
+            parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')
+
+            # 构建带上下文的prompt
+            enhanced_prompt = f"""
+Parent Function: {micro_chunk.metadata.symbol_name}
+Parent Summary: {parent_summary}
+
+Code Block ({micro_chunk.metadata.chunk_type}):
+```
+{micro_chunk.content}
+```
+
+Generate a concise summary (1 sentence) and keywords for this specific code block.
+"""
+
+            metadata = self._call_llm_with_context(enhanced_prompt)
+            results[micro_chunk.metadata.chunk_id] = metadata
+
+        return results
+```
+
+### 5.2 Prompt优化
+
+针对不同层级使用不同的prompt模板：
+
+**Macro Chunk Prompt (Level 1)**:
+```
+PURPOSE: Generate comprehensive semantic metadata for a complete function/class
+TASK:
+- Provide a detailed summary (2-3 sentences) covering what the code does and why
+- Extract 8-12 relevant keywords including technical terms and domain concepts
+- Identify the primary purpose/category
+MODE: analysis
+
+CODE:
+```{language}
+{content}
+```
+
+OUTPUT: JSON with summary, keywords, purpose
+```
+
+**Micro Chunk Prompt (Level 2)**:
+```
+PURPOSE: Summarize a specific logic block within a larger function
+CONTEXT:
+- Parent Function: {symbol_name}
+- Parent Purpose: {parent_summary}
+
+TASK:
+- Provide a brief summary (1 sentence) of this specific block's role in the parent function
+- Extract 3-5 keywords specific to this block's logic
+MODE: analysis
+
+CODE BLOCK ({chunk_type}):
+```{language}
+{content}
+```
+
+OUTPUT: JSON with summary, keywords
+```
+
+## 6. 检索增强
+
+### 6.1 上下文扩展检索
+
+```python
+class ContextualSearchEngine:
+    """支持上下文扩展的检索引擎"""
+
+    def search_with_context(
+        self,
+        query: str,
+        top_k: int = 10,
+        expand_context: bool = True
+    ) -> List[SearchResult]:
+        """
+        检索并自动扩展上下文
+
+        如果匹配到micro chunk，自动返回其父macro chunk作为上下文
+        """
+
+        # 生成查询向量
+        query_embedding = self.embedder.embed_single(query)
+
+        # 层级化检索
+        raw_results = self.vector_store.search_hierarchical(
+            query_embedding,
+            top_k=top_k
+        )
+
+        # 扩展上下文
+        enriched_results = []
+        for chunk, score in raw_results:
+            result = SearchResult(
+                path=chunk.metadata.file_path,
+                score=score,
+                content=chunk.content,
+                start_line=chunk.metadata.start_line,
+                end_line=chunk.metadata.end_line,
+                symbol_name=chunk.metadata.symbol_name,
+            )
+
+            # 如果是micro chunk，获取父chunk作为上下文
+            if expand_context and chunk.metadata.level == 2:
+                parent_chunk, _ = self.vector_store.get_chunk_with_context(
+                    chunk.metadata.chunk_id
+                )
+                if parent_chunk:
+                    result.metadata['parent_context'] = {
+                        'summary': parent_chunk.metadata.context_summary,
+                        'symbol_name': parent_chunk.metadata.symbol_name,
+                        'content': parent_chunk.content,
+                    }
+
+            enriched_results.append(result)
+
+        return enriched_results
+```
+
+## 7. 测试策略
+
+### 7.1 单元测试
+
+```python
+import pytest
+from codexlens.semantic.hierarchical_chunker import (
+    HierarchicalChunker, MacroChunker, MicroChunker
+)
+
+class TestMacroChunker:
+    """测试第一层分词"""
+
+    def test_extract_functions(self):
+        """测试提取函数定义"""
+        code = '''
+def calculate_total(items):
+    """Calculate total price."""
+    total = 0
+    for item in items:
+        total += item.price
+    return total
+
+def apply_discount(total, discount):
+    """Apply discount to total."""
+    return total * (1 - discount)
+'''
+        chunker = MacroChunker()
+        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        assert len(chunks) == 2
+        assert chunks[0].metadata.symbol_name == 'calculate_total'
+        assert chunks[1].metadata.symbol_name == 'apply_discount'
+        assert chunks[0].metadata.level == 1
+
+    def test_extract_with_decorators(self):
+        """测试提取带装饰器的函数"""
+        code = '''
+@app.route('/api/users')
+@auth_required
+def get_users():
+    return User.query.all()
+'''
+        chunker = MacroChunker()
+        chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        assert len(chunks) == 1
+        assert '@app.route' in chunks[0].content
+        assert '@auth_required' in chunks[0].content
+
+class TestMicroChunker:
+    """测试第二层分词"""
+
+    def test_extract_loop_blocks(self):
+        """测试提取循环块"""
+        code = '''
+def process_items(items):
+    results = []
+    for item in items:
+        if item.active:
+            results.append(process(item))
+    return results
+'''
+        macro_chunker = MacroChunker()
+        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        micro_chunker = MicroChunker()
+        micro_chunks = micro_chunker.chunk_logic_blocks(
+            macro_chunks[0], code
+        )
+
+        # 应该提取出for循环和if条件块
+        assert len(micro_chunks) >= 1
+        assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)
+
+    def test_skip_small_functions(self):
+        """测试小函数跳过二次划分"""
+        code = '''
+def small_func(x):
+    return x * 2
+'''
+        macro_chunker = MacroChunker()
+        macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
+
+        micro_chunker = MicroChunker()
+        micro_chunks = micro_chunker.chunk_logic_blocks(
+            macro_chunks[0], code, max_lines=10
+        )
+
+        # 小函数不应该被二次划分
+        assert len(micro_chunks) == 0
+
+class TestHierarchicalChunker:
+    """测试完整的多层次分词"""
+
+    def test_full_hierarchical_chunking(self):
+        """测试完整的层级分词流程"""
+        code = '''
+def complex_function(data):
+    """A complex function with multiple logic blocks."""
+
+    # Validation
+    if not data:
+        raise ValueError("Data is empty")
+
+    # Processing
+    results = []
+    for item in data:
+        try:
+            processed = process_item(item)
+            results.append(processed)
+        except Exception as e:
+            logger.error(f"Failed to process: {e}")
+            continue
+
+    # Aggregation
+    total = sum(r.value for r in results)
+    return total
+'''
+        chunker = HierarchicalChunker()
+        chunks = chunker.chunk_file(code, 'test.py', 'python')
+
+        # 应该有1个macro chunk和多个micro chunks
+        macro_chunks = [c for c in chunks if c.metadata.level == 1]
+        micro_chunks = [c for c in chunks if c.metadata.level == 2]
+
+        assert len(macro_chunks) == 1
+        assert len(micro_chunks) > 0
+
+        # 验证父子关系
+        for micro in micro_chunks:
+            assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id
+```
+
+### 7.2 集成测试
+
+```python
+class TestHierarchicalIndexing:
+    """测试完整的索引流程"""
+
+    def test_index_and_search(self):
+        """测试分层索引和检索"""
+
+        # 1. 分词
+        chunker = HierarchicalChunker()
+        chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')
+
+        # 2. LLM增强
+        enhancer = HierarchicalLLMEnhancer()
+        metadata = enhancer.enhance_hierarchical_chunks(chunks)
+
+        # 3. 向量化
+        embedder = Embedder()
+        for chunk in chunks:
+            text = metadata[chunk.metadata.chunk_id].summary
+            chunk.embedding = embedder.embed_single(text)
+
+        # 4. 存储
+        vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
+        for chunk in chunks:
+            vector_store.add_chunk(chunk)
+
+        # 5. 检索
+        search_engine = ContextualSearchEngine(vector_store, embedder)
+        results = search_engine.search_with_context(
+            "find loop that processes items",
+            top_k=5
+        )
+
+        # 验证结果
+        assert len(results) > 0
+        assert any(r.metadata.get('parent_context') for r in results)
+```
+
+## 8. 性能优化
+
+### 8.1 批量处理
+
+```python
+class BatchHierarchicalProcessor:
+    """批量处理多个文件的层级分词"""
+
+    def process_files_batch(
+        self,
+        file_paths: List[Path],
+        batch_size: int = 10
+    ):
+        """批量处理，优化LLM调用"""
+
+        all_chunks = []
+
+        # 1. 批量分词
+        for file_path in file_paths:
+            content = file_path.read_text()
+            chunks = self.chunker.chunk_file(
+                content, str(file_path), self._detect_language(file_path)
+            )
+            all_chunks.extend(chunks)
+
+        # 2. 批量LLM增强（减少API调用）
+        macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
+        for i in range(0, len(macro_chunks), batch_size):
+            batch = macro_chunks[i:i+batch_size]
+            self.enhancer.enhance_batch(batch)
+
+        # 3. 批量向量化
+        all_texts = [c.content for c in all_chunks]
+        embeddings = self.embedder.embed_batch(all_texts)
+        for chunk, embedding in zip(all_chunks, embeddings):
+            chunk.embedding = embedding
+
+        # 4. 批量存储
+        self.vector_store.add_chunks_batch(all_chunks)
+```
+
+### 8.2 增量更新
+
+```python
+class IncrementalIndexer:
+    """增量索引器：只处理变化的文件"""
+
+    def update_file(self, file_path: Path):
+        """增量更新单个文件"""
+
+        content = file_path.read_text()
+        content_hash = hashlib.sha256(content.encode()).hexdigest()
+
+        # 检查文件是否变化
+        cursor = self.conn.cursor()
+        cursor.execute("""
+            SELECT content_hash FROM chunks
+            WHERE file_path = ? AND level = 1
+            LIMIT 1
+        """, (str(file_path),))
+
+        row = cursor.fetchone()
+        if row and row[0] == content_hash:
+            logger.info(f"File {file_path} unchanged, skipping")
+            return
+
+        # 删除旧chunk
+        cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))
+
+        # 重新索引
+        chunks = self.chunker.chunk_file(content, str(file_path), 'python')
+        # ... 继续处理
+```
+
+## 9. 潜在问题与解决方案
+
+### 9.1 问题：超大函数的micro chunk过多
+
+**现象**：某些遗留代码函数超过1000行，可能产生几十个micro chunks。
+
+**解决方案**：
+```python
+class AdaptiveMicroChunker:
+    """自适应micro分词：根据函数大小调整策略"""
+
+    def chunk_logic_blocks(self, macro_chunk, content):
+        total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
+
+        if total_lines > 500:
+            # 超大函数：只提取顶层逻辑块，不递归
+            return self._extract_top_level_blocks(macro_chunk, content)
+        elif total_lines > 100:
+            # 大函数：递归深度限制为2层
+            return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
+        else:
+            # 正常函数：完全跳过micro chunking
+            return []
+```
+
+### 9.2 问题：tree-sitter解析失败
+
+**现象**：对于语法错误的代码，tree-sitter解析可能失败。
+
+**解决方案**：
+```python
+def chunk_file_with_fallback(self, content, file_path, language):
+    """带降级策略的分词"""
+
+    try:
+        # 尝试层级分词
+        return self.chunk_file(content, file_path, language)
+    except TreeSitterError as e:
+        logger.warning(f"Tree-sitter parsing failed: {e}")
+
+        # 降级到基于正则的简单symbol提取
+        return self._fallback_regex_chunking(content, file_path)
+    except Exception as e:
+        logger.error(f"Chunking failed completely: {e}")
+
+        # 最终降级到滑动窗口
+        return self._fallback_sliding_window(content, file_path, language)
+```
+
+### 9.3 问题：向量存储空间占用
+
+**现象**：每个chunk都存储向量，空间占用可能很大。
+
+**解决方案**：
+- **选择性向量化**：只对macro chunks和重要的micro chunks生成向量
+- **向量压缩**：使用PCA或量化技术减少向量维度
+- **分离存储**：向量存储在专门的向量数据库（如Faiss），SQLite只存元数据
+
+```python
+class SelectiveVectorization:
+    """选择性向量化：减少存储开销"""
+
+    VECTORIZE_CHUNK_TYPES = {
+        'function_definition',   # 总是向量化
+        'class_definition',      # 总是向量化
+        'for_statement',         # 循环块
+        'try_statement',         # 异常处理
+        # 'if_statement' 通常不单独向量化，依赖父chunk
+    }
+
+    def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
+        """判断是否需要为chunk生成向量"""
+
+        # Level 1总是向量化
+        if chunk.metadata.level == 1:
+            return True
+
+        # Level 2根据类型和大小决定
+        if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
+            return False
+
+        # 太小的块（<5行）不向量化
+        lines = chunk.metadata.end_line - chunk.metadata.start_line
+        if lines < 5:
+            return False
+
+        return True
+```
+
+## 10. 实施路线图
+
+### Phase 1: 基础架构（2-3周）
+- [x] 设计数据结构（HierarchicalChunk, ChunkMetadata）
+- [ ] 实现MacroChunker（复用现有code_extractor）
+- [ ] 实现基础的MicroChunker
+- [ ] 数据库schema设计和migration
+- [ ] 单元测试
+
+### Phase 2: LLM集成（1-2周）
+- [ ] 实现HierarchicalLLMEnhancer
+- [ ] 设计分层prompt模板
+- [ ] 批量处理优化
+- [ ] 集成测试
+
+### Phase 3: 向量化与检索（1-2周）
+- [ ] 实现HierarchicalVectorStore
+- [ ] 实现ContextualSearchEngine
+- [ ] 上下文扩展逻辑
+- [ ] 检索性能测试
+
+### Phase 4: 优化与完善（2周）
+- [ ] 性能优化（批量处理、增量更新）
+- [ ] 降级策略完善
+- [ ] 选择性向量化
+- [ ] 全面测试和文档
+
+### Phase 5: 生产部署（1周）
+- [ ] CLI集成
+- [ ] 配置选项暴露
+- [ ] 生产环境测试
+- [ ] 发布
+
+**总计预估时间**：7-10周
+
+## 11. 成功指标
+
+1. **覆盖率**：95%以上的代码能被正确分词
+2. **准确率**：层级关系准确率>98%
+3. **检索质量**：相比单层分词，检索相关性提升30%+
+4. **性能**：单文件分词<100ms，批量处理>100文件/分钟
+5. **存储效率**：相比全向量化，空间占用减少40%+
+
+## 12. 参考资料
+
+- [Tree-sitter Documentation](https://tree-sitter.github.io/)
+- [AST-based Code Analysis](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
+- [Hierarchical Text Segmentation](https://arxiv.org/abs/2104.08836)
+- 现有代码：`src/codexlens/semantic/chunker.py`
--- a/codex-lens/docs/SEMANTIC_GRAPH_DESIGN.md
+++ b/codex-lens/docs/SEMANTIC_GRAPH_DESIGN.md