mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-12 02:37:45 +08:00
feat: add semantic graph design for static code analysis
- Introduced a comprehensive design document for a Code Semantic Graph aimed at enhancing static analysis capabilities. - Defined the architecture, core components, and implementation steps for analyzing function calls, data flow, and dependencies. - Included detailed specifications for nodes and edges in the graph, along with database schema for storage. - Outlined phases for implementation, technical challenges, success metrics, and application scenarios.
This commit is contained in:
@@ -43,4 +43,5 @@ Before implementation, always:
|
|||||||
- `exact`: Known exact pattern
|
- `exact`: Known exact pattern
|
||||||
- `fuzzy`: Typo-tolerant search
|
- `fuzzy`: Typo-tolerant search
|
||||||
- `semantic`: Concept-based search
|
- `semantic`: Concept-based search
|
||||||
- `graph`: Dependency analysis
|
- `graph`: Dependency analysis
|
||||||
|
|
||||||
|
|||||||
@@ -45,3 +45,49 @@
|
|||||||
**Use semantic search** for exploratory tasks
|
**Use semantic search** for exploratory tasks
|
||||||
**Use indexed search** for large, stable codebases
|
**Use indexed search** for large, stable codebases
|
||||||
**Use Exa** for external/public knowledge
|
**Use Exa** for external/public knowledge
|
||||||
|
|
||||||
|
## ⚡ Core Search Tools
|
||||||
|
|
||||||
|
**rg (ripgrep)**: Fast content search with regex support
|
||||||
|
**find**: File/directory location by name patterns
|
||||||
|
**grep**: Built-in pattern matching (fallback when rg unavailable)
|
||||||
|
**get_modules_by_depth**: Program architecture analysis (MANDATORY before planning)
|
||||||
|
|
||||||
|
|
||||||
|
## 🔧 Quick Command Reference
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Semantic File Discovery (codebase-retrieval via CCW)
|
||||||
|
ccw cli exec "
|
||||||
|
PURPOSE: Discover files relevant to task/feature
|
||||||
|
TASK: • List all files related to [task/feature description]
|
||||||
|
MODE: analysis
|
||||||
|
CONTEXT: @**/*
|
||||||
|
EXPECTED: Relevant file paths with relevance explanation
|
||||||
|
RULES: Focus on direct relevance to task requirements | analysis=READ-ONLY
|
||||||
|
" --tool gemini --cd [directory]
|
||||||
|
|
||||||
|
# Program Architecture (MANDATORY before planning)
|
||||||
|
ccw tool exec get_modules_by_depth '{}'
|
||||||
|
|
||||||
|
# Content Search (rg preferred)
|
||||||
|
rg "pattern" --type js -n # Search JS files with line numbers
|
||||||
|
rg -i "case-insensitive" # Ignore case
|
||||||
|
rg -C 3 "context" # Show 3 lines before/after
|
||||||
|
|
||||||
|
# File Search
|
||||||
|
find . -name "*.ts" -type f # Find TypeScript files
|
||||||
|
find . -path "*/node_modules" -prune -o -name "*.js" -print
|
||||||
|
|
||||||
|
# Workflow Examples
|
||||||
|
rg "IMPL-\d+" .workflow/ --type json # Find task IDs
|
||||||
|
find .workflow/ -name "*.json" -path "*/.task/*" # Locate task files
|
||||||
|
rg "status.*pending" .workflow/.task/ # Find pending tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
## ⚡ Performance Tips
|
||||||
|
|
||||||
|
- **rg > grep** for content search
|
||||||
|
- **Use --type filters** to limit file types
|
||||||
|
- **Exclude dirs**: `--glob '!node_modules'`
|
||||||
|
- **Use -F** for literal strings (no regex)
|
||||||
|
|||||||
@@ -13,7 +13,7 @@
|
|||||||
**rg (ripgrep)**: Fast content search with regex support
|
**rg (ripgrep)**: Fast content search with regex support
|
||||||
**find**: File/directory location by name patterns
|
**find**: File/directory location by name patterns
|
||||||
**grep**: Built-in pattern matching (fallback when rg unavailable)
|
**grep**: Built-in pattern matching (fallback when rg unavailable)
|
||||||
**get_modules_by_depth.sh**: Program architecture analysis (MANDATORY before planning)
|
**get_modules_by_depth**: Program architecture analysis (MANDATORY before planning)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
15
.mcp.json
15
.mcp.json
@@ -1,22 +1,11 @@
|
|||||||
{
|
{
|
||||||
"mcpServers": {
|
"mcpServers": {
|
||||||
"test-mcp-server": {
|
|
||||||
"command": "npx",
|
|
||||||
"args": [
|
|
||||||
"-y",
|
|
||||||
"@modelcontextprotocol/server-filesystem",
|
|
||||||
"D:/Claude_dms3"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"ccw-tools": {
|
"ccw-tools": {
|
||||||
"command": "npx",
|
"command": "npx",
|
||||||
"args": [
|
"args": [
|
||||||
"-y",
|
"-y",
|
||||||
"ccw-mcp"
|
"ccw-mcp"
|
||||||
],
|
]
|
||||||
"env": {
|
|
||||||
"CCW_ENABLED_TOOLS": "write_file,edit_file,codex_lens,smart_search"
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -1,190 +0,0 @@
|
|||||||
# Implementation Summary: Rules CLI Generation Feature
|
|
||||||
|
|
||||||
## Status: ✅ Complete
|
|
||||||
|
|
||||||
## Files Modified
|
|
||||||
|
|
||||||
### D:\Claude_dms3\ccw\src\core\routes\rules-routes.ts
|
|
||||||
|
|
||||||
**Changes:**
|
|
||||||
1. Added import for `executeCliTool` from cli-executor
|
|
||||||
2. Implemented `generateRuleViaCLI()` function
|
|
||||||
3. Modified POST `/api/rules/create` endpoint to support `mode: 'cli-generate'`
|
|
||||||
|
|
||||||
## Implementation Details
|
|
||||||
|
|
||||||
### 1. New Function: `generateRuleViaCLI()`
|
|
||||||
|
|
||||||
**Location:** lines 224-340
|
|
||||||
|
|
||||||
**Purpose:** Generate rule content using Gemini CLI based on different generation strategies
|
|
||||||
|
|
||||||
**Parameters:**
|
|
||||||
- `generationType`: 'description' | 'template' | 'extract'
|
|
||||||
- `description`: Natural language description of the rule
|
|
||||||
- `templateType`: Template category for structured generation
|
|
||||||
- `extractScope`: File pattern for code analysis (e.g., 'src/**/*.ts')
|
|
||||||
- `extractFocus`: Focus areas for extraction (e.g., 'error handling, naming')
|
|
||||||
- `fileName`: Target filename (must end with .md)
|
|
||||||
- `location`: 'project' or 'user'
|
|
||||||
- `subdirectory`: Optional subdirectory path
|
|
||||||
- `projectPath`: Project root directory
|
|
||||||
|
|
||||||
**Process Flow:**
|
|
||||||
1. Parse parameters and determine generation type
|
|
||||||
2. Build appropriate CLI prompt template based on type
|
|
||||||
3. Execute Gemini CLI with:
|
|
||||||
- Tool: 'gemini'
|
|
||||||
- Mode: 'write' for description/template, 'analysis' for extract
|
|
||||||
- Timeout: 10 minutes (600000ms)
|
|
||||||
- Working directory: projectPath
|
|
||||||
4. Validate CLI execution result
|
|
||||||
5. Extract generated content from stdout
|
|
||||||
6. Call `createRule()` to save the file
|
|
||||||
7. Return result with execution ID
|
|
||||||
|
|
||||||
### 2. Prompt Templates
|
|
||||||
|
|
||||||
#### Description Mode (write)
|
|
||||||
```
|
|
||||||
PURPOSE: Generate Claude Code memory rule from description to guide Claude's behavior
|
|
||||||
TASK: • Analyze rule requirements • Generate markdown content with clear instructions
|
|
||||||
MODE: write
|
|
||||||
EXPECTED: Complete rule content in markdown format
|
|
||||||
RULES: $(cat ~/.claude/workflows/cli-templates/prompts/universal/00-universal-rigorous-style.txt)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Template Mode (write)
|
|
||||||
```
|
|
||||||
PURPOSE: Generate Claude Code rule from template type
|
|
||||||
TASK: • Create rule based on {templateType} • Generate structured markdown content
|
|
||||||
MODE: write
|
|
||||||
EXPECTED: Complete rule content in markdown format following template structure
|
|
||||||
RULES: $(cat ~/.claude/workflows/cli-templates/prompts/universal/00-universal-rigorous-style.txt)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Extract Mode (analysis)
|
|
||||||
```
|
|
||||||
PURPOSE: Extract coding rules from existing codebase to document patterns and conventions
|
|
||||||
TASK: • Analyze code patterns • Extract common conventions • Identify best practices
|
|
||||||
MODE: analysis
|
|
||||||
CONTEXT: @{extractScope || '**/*'}
|
|
||||||
EXPECTED: Rule content based on codebase analysis with examples
|
|
||||||
RULES: $(cat ~/.claude/workflows/cli-templates/prompts/analysis/02-analyze-code-patterns.txt)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. API Endpoint Modification
|
|
||||||
|
|
||||||
**Endpoint:** POST `/api/rules/create`
|
|
||||||
|
|
||||||
**Enhanced Request Body:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"mode": "cli-generate", // NEW: triggers CLI generation
|
|
||||||
"generationType": "description", // NEW: 'description' | 'template' | 'extract'
|
|
||||||
"description": "...", // NEW: for description mode
|
|
||||||
"templateType": "...", // NEW: for template mode
|
|
||||||
"extractScope": "src/**/*.ts", // NEW: for extract mode
|
|
||||||
"extractFocus": "...", // NEW: for extract mode
|
|
||||||
"fileName": "rule-name.md", // REQUIRED
|
|
||||||
"location": "project", // REQUIRED: 'project' | 'user'
|
|
||||||
"subdirectory": "", // OPTIONAL
|
|
||||||
"projectPath": "..." // OPTIONAL: defaults to initialPath
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Backward Compatibility:** Existing manual creation still works:
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"fileName": "rule-name.md",
|
|
||||||
"content": "# Rule Content\n...",
|
|
||||||
"location": "project",
|
|
||||||
"paths": [],
|
|
||||||
"subdirectory": ""
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**Response Format:**
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"fileName": "rule-name.md",
|
|
||||||
"location": "project",
|
|
||||||
"path": "/absolute/path/to/rule-name.md",
|
|
||||||
"subdirectory": null,
|
|
||||||
"generatedContent": "# Generated Content\n...",
|
|
||||||
"executionId": "1734168000000-gemini"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Error Handling
|
|
||||||
|
|
||||||
### Validation Errors
|
|
||||||
- Missing `fileName`: "File name is required"
|
|
||||||
- Missing `location`: "Location is required (project or user)"
|
|
||||||
- Missing `generationType` in CLI mode: "generationType is required for CLI generation mode"
|
|
||||||
- Missing `description` for description mode: "description is required for description-based generation"
|
|
||||||
- Missing `templateType` for template mode: "templateType is required for template-based generation"
|
|
||||||
- Unknown `generationType`: "Unknown generation type: {type}"
|
|
||||||
|
|
||||||
### CLI Execution Errors
|
|
||||||
- CLI tool failure: Returns `{ error: "CLI execution failed: ...", stderr: "..." }`
|
|
||||||
- Empty content: Returns `{ error: "CLI execution returned empty content", stdout: "...", stderr: "..." }`
|
|
||||||
- Timeout: CLI executor will timeout after 10 minutes
|
|
||||||
- File exists: "Rule '{fileName}' already exists in {location} location"
|
|
||||||
|
|
||||||
## Testing
|
|
||||||
|
|
||||||
### Test Document
|
|
||||||
Created: `D:\Claude_dms3\test-rules-cli-generation.md`
|
|
||||||
|
|
||||||
Contains:
|
|
||||||
- API usage examples for all 3 generation types
|
|
||||||
- Request/response format examples
|
|
||||||
- Error handling scenarios
|
|
||||||
- Integration details
|
|
||||||
|
|
||||||
### Compilation Test
|
|
||||||
✅ TypeScript compilation successful (`npm run build`)
|
|
||||||
|
|
||||||
## Integration Points
|
|
||||||
|
|
||||||
### Dependencies
|
|
||||||
- **cli-executor.ts**: Provides `executeCliTool()` for Gemini execution
|
|
||||||
- **createRule()**: Existing function for file creation
|
|
||||||
- **handlePostRequest()**: Existing request handler from RouteContext
|
|
||||||
|
|
||||||
### CLI Tool
|
|
||||||
- **Tool**: Gemini (via `executeCliTool()`)
|
|
||||||
- **Timeout**: 10 minutes (600000ms)
|
|
||||||
- **Mode**: 'write' for generation, 'analysis' for extraction
|
|
||||||
- **Working Directory**: Project path for context access
|
|
||||||
|
|
||||||
## Next Steps (Not Implemented)
|
|
||||||
|
|
||||||
1. **UI Integration**: Add frontend interface in Rules Manager dashboard
|
|
||||||
2. **Streaming Output**: Display CLI execution progress in real-time
|
|
||||||
3. **Preview**: Show generated content before saving
|
|
||||||
4. **Refinement**: Allow iterative refinement of generated rules
|
|
||||||
5. **Templates Library**: Add predefined template types
|
|
||||||
6. **History**: Track generation history and allow regeneration
|
|
||||||
|
|
||||||
## Verification Checklist
|
|
||||||
|
|
||||||
- [x] Import cli-executor functions
|
|
||||||
- [x] Implement `generateRuleViaCLI()` with 3 generation types
|
|
||||||
- [x] Build appropriate prompts for each type
|
|
||||||
- [x] Use correct MODE (analysis vs write)
|
|
||||||
- [x] Set timeout to at least 10 minutes
|
|
||||||
- [x] Integrate with `createRule()` for file creation
|
|
||||||
- [x] Modify POST endpoint to support `mode: 'cli-generate'`
|
|
||||||
- [x] Validate required parameters
|
|
||||||
- [x] Return unified result format
|
|
||||||
- [x] Handle errors appropriately
|
|
||||||
- [x] Maintain backward compatibility
|
|
||||||
- [x] Verify TypeScript compilation
|
|
||||||
- [x] Create test documentation
|
|
||||||
|
|
||||||
## Files Created
|
|
||||||
- `D:\Claude_dms3\test-rules-cli-generation.md`: Test documentation
|
|
||||||
- `D:\Claude_dms3\IMPLEMENTATION_SUMMARY.md`: This file
|
|
||||||
@@ -77,7 +77,7 @@ function getMcpServersFromFile(filePath) {
|
|||||||
*/
|
*/
|
||||||
function addMcpServerToMcpJson(projectPath, serverName, serverConfig) {
|
function addMcpServerToMcpJson(projectPath, serverName, serverConfig) {
|
||||||
try {
|
try {
|
||||||
const normalizedPath = normalizeProjectPathForConfig(projectPath);
|
const normalizedPath = normalizePathForFileSystem(projectPath);
|
||||||
const mcpJsonPath = join(normalizedPath, '.mcp.json');
|
const mcpJsonPath = join(normalizedPath, '.mcp.json');
|
||||||
|
|
||||||
// Read existing .mcp.json or create new structure
|
// Read existing .mcp.json or create new structure
|
||||||
@@ -115,7 +115,7 @@ function addMcpServerToMcpJson(projectPath, serverName, serverConfig) {
|
|||||||
*/
|
*/
|
||||||
function removeMcpServerFromMcpJson(projectPath, serverName) {
|
function removeMcpServerFromMcpJson(projectPath, serverName) {
|
||||||
try {
|
try {
|
||||||
const normalizedPath = normalizeProjectPathForConfig(projectPath);
|
const normalizedPath = normalizePathForFileSystem(projectPath);
|
||||||
const mcpJsonPath = join(normalizedPath, '.mcp.json');
|
const mcpJsonPath = join(normalizedPath, '.mcp.json');
|
||||||
|
|
||||||
if (!existsSync(mcpJsonPath)) {
|
if (!existsSync(mcpJsonPath)) {
|
||||||
@@ -238,22 +238,43 @@ function getMcpConfig() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Normalize project path for .claude.json (Windows backslash format)
|
* Normalize path to filesystem format (for accessing .mcp.json files)
|
||||||
|
* Always uses forward slashes for cross-platform compatibility
|
||||||
* @param {string} path
|
* @param {string} path
|
||||||
* @returns {string}
|
* @returns {string}
|
||||||
*/
|
*/
|
||||||
function normalizeProjectPathForConfig(path) {
|
function normalizePathForFileSystem(path) {
|
||||||
// Convert forward slashes to backslashes for Windows .claude.json format
|
let normalized = path.replace(/\\/g, '/');
|
||||||
let normalized = path.replace(/\//g, '\\');
|
|
||||||
|
// Handle /d/path format -> D:/path
|
||||||
// Handle /d/path format -> D:\path
|
if (normalized.match(/^\/[a-zA-Z]\//)) {
|
||||||
if (normalized.match(/^\\[a-zA-Z]\\/)) {
|
|
||||||
normalized = normalized.charAt(1).toUpperCase() + ':' + normalized.slice(2);
|
normalized = normalized.charAt(1).toUpperCase() + ':' + normalized.slice(2);
|
||||||
}
|
}
|
||||||
|
|
||||||
return normalized;
|
return normalized;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Normalize project path to match existing format in .claude.json
|
||||||
|
* Checks both forward slash and backslash formats to find existing entry
|
||||||
|
* @param {string} path
|
||||||
|
* @param {Object} claudeConfig - Optional existing config to check format
|
||||||
|
* @returns {string}
|
||||||
|
*/
|
||||||
|
function normalizeProjectPathForConfig(path, claudeConfig = null) {
|
||||||
|
// IMPORTANT: Always normalize to forward slashes to prevent duplicate entries
|
||||||
|
// (e.g., prevents both "D:/Claude_dms3" and "D:\\Claude_dms3")
|
||||||
|
let normalizedForward = path.replace(/\\/g, '/');
|
||||||
|
|
||||||
|
// Handle /d/path format -> D:/path
|
||||||
|
if (normalizedForward.match(/^\/[a-zA-Z]\//)) {
|
||||||
|
normalizedForward = normalizedForward.charAt(1).toUpperCase() + ':' + normalizedForward.slice(2);
|
||||||
|
}
|
||||||
|
|
||||||
|
// ALWAYS return forward slash format to prevent duplicates
|
||||||
|
return normalizedForward;
|
||||||
|
}
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Toggle MCP server enabled/disabled
|
* Toggle MCP server enabled/disabled
|
||||||
* @param {string} projectPath
|
* @param {string} projectPath
|
||||||
@@ -270,7 +291,7 @@ function toggleMcpServerEnabled(projectPath, serverName, enable) {
|
|||||||
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
||||||
const config = JSON.parse(content);
|
const config = JSON.parse(content);
|
||||||
|
|
||||||
const normalizedPath = normalizeProjectPathForConfig(projectPath);
|
const normalizedPath = normalizeProjectPathForConfig(projectPath, config);
|
||||||
|
|
||||||
if (!config.projects || !config.projects[normalizedPath]) {
|
if (!config.projects || !config.projects[normalizedPath]) {
|
||||||
return { error: `Project not found: ${normalizedPath}` };
|
return { error: `Project not found: ${normalizedPath}` };
|
||||||
@@ -332,7 +353,7 @@ function addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyC
|
|||||||
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
||||||
const config = JSON.parse(content);
|
const config = JSON.parse(content);
|
||||||
|
|
||||||
const normalizedPath = normalizeProjectPathForConfig(projectPath);
|
const normalizedPath = normalizeProjectPathForConfig(projectPath, config);
|
||||||
|
|
||||||
// Create project entry if it doesn't exist
|
// Create project entry if it doesn't exist
|
||||||
if (!config.projects) {
|
if (!config.projects) {
|
||||||
@@ -387,8 +408,8 @@ function addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyC
|
|||||||
*/
|
*/
|
||||||
function removeMcpServerFromProject(projectPath, serverName) {
|
function removeMcpServerFromProject(projectPath, serverName) {
|
||||||
try {
|
try {
|
||||||
const normalizedPath = normalizeProjectPathForConfig(projectPath);
|
const normalizedPathForFile = normalizePathForFileSystem(projectPath);
|
||||||
const mcpJsonPath = join(normalizedPath, '.mcp.json');
|
const mcpJsonPath = join(normalizedPathForFile, '.mcp.json');
|
||||||
|
|
||||||
let removedFromMcpJson = false;
|
let removedFromMcpJson = false;
|
||||||
let removedFromClaudeJson = false;
|
let removedFromClaudeJson = false;
|
||||||
@@ -409,6 +430,9 @@ function removeMcpServerFromProject(projectPath, serverName) {
|
|||||||
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
const content = readFileSync(CLAUDE_CONFIG_PATH, 'utf8');
|
||||||
const config = JSON.parse(content);
|
const config = JSON.parse(content);
|
||||||
|
|
||||||
|
// Get normalized path that matches existing config format
|
||||||
|
const normalizedPath = normalizeProjectPathForConfig(projectPath, config);
|
||||||
|
|
||||||
if (config.projects && config.projects[normalizedPath]) {
|
if (config.projects && config.projects[normalizedPath]) {
|
||||||
const projectConfig = config.projects[normalizedPath];
|
const projectConfig = config.projects[normalizedPath];
|
||||||
|
|
||||||
@@ -597,11 +621,13 @@ export async function handleMcpRoutes(ctx: RouteContext): Promise<boolean> {
|
|||||||
// API: Copy MCP server to project
|
// API: Copy MCP server to project
|
||||||
if (pathname === '/api/mcp-copy-server' && req.method === 'POST') {
|
if (pathname === '/api/mcp-copy-server' && req.method === 'POST') {
|
||||||
handlePostRequest(req, res, async (body) => {
|
handlePostRequest(req, res, async (body) => {
|
||||||
const { projectPath, serverName, serverConfig } = body;
|
const { projectPath, serverName, serverConfig, configType } = body;
|
||||||
if (!projectPath || !serverName || !serverConfig) {
|
if (!projectPath || !serverName || !serverConfig) {
|
||||||
return { error: 'projectPath, serverName, and serverConfig are required', status: 400 };
|
return { error: 'projectPath, serverName, and serverConfig are required', status: 400 };
|
||||||
}
|
}
|
||||||
return addMcpServerToProject(projectPath, serverName, serverConfig);
|
// configType: 'mcp' = use .mcp.json (default), 'claude' = use .claude.json
|
||||||
|
const useLegacyConfig = configType === 'claude';
|
||||||
|
return addMcpServerToProject(projectPath, serverName, serverConfig, useLegacyConfig);
|
||||||
});
|
});
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -733,7 +733,7 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
}
|
}
|
||||||
|
|
||||||
try {
|
try {
|
||||||
const configPath = join(projectPath, '.claude', 'rules', 'active_memory.md');
|
const configPath = join(projectPath, '.claude', 'CLAUDE.md');
|
||||||
const configJsonPath = join(projectPath, '.claude', 'active_memory_config.json');
|
const configJsonPath = join(projectPath, '.claude', 'active_memory_config.json');
|
||||||
const enabled = existsSync(configPath);
|
const enabled = existsSync(configPath);
|
||||||
let lastSync: string | null = null;
|
let lastSync: string | null = null;
|
||||||
@@ -784,16 +784,12 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
const rulesDir = join(projectPath, '.claude', 'rules');
|
|
||||||
const claudeDir = join(projectPath, '.claude');
|
const claudeDir = join(projectPath, '.claude');
|
||||||
const configPath = join(rulesDir, 'active_memory.md');
|
const configPath = join(claudeDir, 'CLAUDE.md');
|
||||||
const configJsonPath = join(claudeDir, 'active_memory_config.json');
|
const configJsonPath = join(claudeDir, 'active_memory_config.json');
|
||||||
|
|
||||||
if (enabled) {
|
if (enabled) {
|
||||||
// Enable: Create directories and initial file
|
// Enable: Create directories and initial file
|
||||||
if (!existsSync(rulesDir)) {
|
|
||||||
mkdirSync(rulesDir, { recursive: true });
|
|
||||||
}
|
|
||||||
if (!existsSync(claudeDir)) {
|
if (!existsSync(claudeDir)) {
|
||||||
mkdirSync(claudeDir, { recursive: true });
|
mkdirSync(claudeDir, { recursive: true });
|
||||||
}
|
}
|
||||||
@@ -803,8 +799,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
writeFileSync(configJsonPath, JSON.stringify(config, null, 2), 'utf-8');
|
writeFileSync(configJsonPath, JSON.stringify(config, null, 2), 'utf-8');
|
||||||
}
|
}
|
||||||
|
|
||||||
// Create initial active_memory.md with header
|
// Create initial CLAUDE.md with header
|
||||||
const initialContent = `# Active Memory
|
const initialContent = `# CLAUDE.md - Project Memory
|
||||||
|
|
||||||
> Auto-generated understanding of frequently accessed files.
|
> Auto-generated understanding of frequently accessed files.
|
||||||
> Last updated: ${new Date().toISOString()}
|
> Last updated: ${new Date().toISOString()}
|
||||||
@@ -867,7 +863,7 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
// API: Active Memory - Sync (analyze hot files using CLI and update active_memory.md)
|
// API: Active Memory - Sync (analyze hot files using CLI and update CLAUDE.md)
|
||||||
if (pathname === '/api/memory/active/sync' && req.method === 'POST') {
|
if (pathname === '/api/memory/active/sync' && req.method === 'POST') {
|
||||||
let body = '';
|
let body = '';
|
||||||
req.on('data', (chunk: Buffer) => { body += chunk.toString(); });
|
req.on('data', (chunk: Buffer) => { body += chunk.toString(); });
|
||||||
@@ -882,8 +878,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
const rulesDir = join(projectPath, '.claude', 'rules');
|
const claudeDir = join(projectPath, '.claude');
|
||||||
const configPath = join(rulesDir, 'active_memory.md');
|
const configPath = join(claudeDir, 'CLAUDE.md');
|
||||||
|
|
||||||
// Get hot files from memory store - with fallback
|
// Get hot files from memory store - with fallback
|
||||||
let hotFiles: any[] = [];
|
let hotFiles: any[] = [];
|
||||||
@@ -903,8 +899,8 @@ Return ONLY valid JSON in this exact format (no markdown, no code blocks, just p
|
|||||||
return isAbsolute(filePath) ? filePath : join(projectPath, filePath);
|
return isAbsolute(filePath) ? filePath : join(projectPath, filePath);
|
||||||
}).filter((p: string) => existsSync(p));
|
}).filter((p: string) => existsSync(p));
|
||||||
|
|
||||||
// Build the active memory content header
|
// Build the CLAUDE.md content header
|
||||||
let content = `# Active Memory
|
let content = `# CLAUDE.md - Project Memory
|
||||||
|
|
||||||
> Auto-generated understanding of frequently accessed files using ${tool.toUpperCase()}.
|
> Auto-generated understanding of frequently accessed files using ${tool.toUpperCase()}.
|
||||||
> Last updated: ${new Date().toISOString()}
|
> Last updated: ${new Date().toISOString()}
|
||||||
@@ -942,14 +938,29 @@ RULES: Be concise. Focus on practical understanding. Include function signatures
|
|||||||
});
|
});
|
||||||
|
|
||||||
if (result.success && result.execution?.output) {
|
if (result.success && result.execution?.output) {
|
||||||
// Extract stdout from output object
|
// Extract stdout from output object with proper serialization
|
||||||
cliOutput = typeof result.execution.output === 'string'
|
const output = result.execution.output;
|
||||||
? result.execution.output
|
if (typeof output === 'string') {
|
||||||
: result.execution.output.stdout || '';
|
cliOutput = output;
|
||||||
|
} else if (output && typeof output === 'object') {
|
||||||
|
// Handle object output - extract stdout or serialize the object
|
||||||
|
if (output.stdout && typeof output.stdout === 'string') {
|
||||||
|
cliOutput = output.stdout;
|
||||||
|
} else if (output.stderr && typeof output.stderr === 'string') {
|
||||||
|
cliOutput = output.stderr;
|
||||||
|
} else {
|
||||||
|
// Last resort: serialize the entire object as JSON
|
||||||
|
cliOutput = JSON.stringify(output, null, 2);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
cliOutput = '';
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Add CLI output to content
|
// Add CLI output to content (only if not empty)
|
||||||
content += cliOutput + '\n\n---\n\n';
|
if (cliOutput && cliOutput.trim()) {
|
||||||
|
content += cliOutput + '\n\n---\n\n';
|
||||||
|
}
|
||||||
|
|
||||||
} catch (cliErr) {
|
} catch (cliErr) {
|
||||||
// Fallback to basic analysis if CLI fails
|
// Fallback to basic analysis if CLI fails
|
||||||
@@ -1007,8 +1018,8 @@ RULES: Be concise. Focus on practical understanding. Include function signatures
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Ensure directory exists
|
// Ensure directory exists
|
||||||
if (!existsSync(rulesDir)) {
|
if (!existsSync(claudeDir)) {
|
||||||
mkdirSync(rulesDir, { recursive: true });
|
mkdirSync(claudeDir, { recursive: true });
|
||||||
}
|
}
|
||||||
|
|
||||||
// Write the file
|
// Write the file
|
||||||
|
|||||||
@@ -87,15 +87,23 @@ async function toggleMcpServer(serverName, enable) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
async function copyMcpServerToProject(serverName, serverConfig) {
|
async function copyMcpServerToProject(serverName, serverConfig, configType = null) {
|
||||||
try {
|
try {
|
||||||
|
// If configType not specified, ask user to choose
|
||||||
|
if (!configType) {
|
||||||
|
const choice = await showConfigTypeDialog();
|
||||||
|
if (!choice) return null; // User cancelled
|
||||||
|
configType = choice;
|
||||||
|
}
|
||||||
|
|
||||||
const response = await fetch('/api/mcp-copy-server', {
|
const response = await fetch('/api/mcp-copy-server', {
|
||||||
method: 'POST',
|
method: 'POST',
|
||||||
headers: { 'Content-Type': 'application/json' },
|
headers: { 'Content-Type': 'application/json' },
|
||||||
body: JSON.stringify({
|
body: JSON.stringify({
|
||||||
projectPath: projectPath,
|
projectPath: projectPath,
|
||||||
serverName: serverName,
|
serverName: serverName,
|
||||||
serverConfig: serverConfig
|
serverConfig: serverConfig,
|
||||||
|
configType: configType // 'claude' for .claude.json, 'mcp' for .mcp.json
|
||||||
})
|
})
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -105,7 +113,8 @@ async function copyMcpServerToProject(serverName, serverConfig) {
|
|||||||
if (result.success) {
|
if (result.success) {
|
||||||
await loadMcpConfig();
|
await loadMcpConfig();
|
||||||
renderMcpManager();
|
renderMcpManager();
|
||||||
showRefreshToast(`MCP server "${serverName}" added to project`, 'success');
|
const location = configType === 'mcp' ? '.mcp.json' : '.claude.json';
|
||||||
|
showRefreshToast(`MCP server "${serverName}" added to project (${location})`, 'success');
|
||||||
}
|
}
|
||||||
return result;
|
return result;
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
@@ -115,6 +124,53 @@ async function copyMcpServerToProject(serverName, serverConfig) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Show dialog to let user choose config type
|
||||||
|
function showConfigTypeDialog() {
|
||||||
|
return new Promise((resolve) => {
|
||||||
|
const dialog = document.createElement('div');
|
||||||
|
dialog.className = 'fixed inset-0 bg-black/50 flex items-center justify-center z-50';
|
||||||
|
dialog.innerHTML = `
|
||||||
|
<div class="bg-card border border-border rounded-lg shadow-lg p-6 max-w-md w-full mx-4">
|
||||||
|
<h3 class="text-lg font-semibold mb-4">${t('mcp.chooseInstallLocation')}</h3>
|
||||||
|
<div class="space-y-3 mb-6">
|
||||||
|
<button class="config-type-option w-full text-left px-4 py-3 border border-border rounded-lg hover:bg-accent hover:border-primary transition-all" data-type="claude">
|
||||||
|
<div class="font-medium">${t('mcp.installToClaudeJson')}</div>
|
||||||
|
<div class="text-sm text-muted-foreground mt-1">${t('mcp.claudeJsonDesc')}</div>
|
||||||
|
</button>
|
||||||
|
<button class="config-type-option w-full text-left px-4 py-3 border border-border rounded-lg hover:bg-accent hover:border-primary transition-all" data-type="mcp">
|
||||||
|
<div class="font-medium">${t('mcp.installToMcpJson')}</div>
|
||||||
|
<div class="text-sm text-muted-foreground mt-1">${t('mcp.mcpJsonDesc')}</div>
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
<button class="cancel-btn w-full px-4 py-2 border border-border rounded-lg hover:bg-accent transition-colors">${t('common.cancel')}</button>
|
||||||
|
</div>
|
||||||
|
`;
|
||||||
|
document.body.appendChild(dialog);
|
||||||
|
|
||||||
|
const options = dialog.querySelectorAll('.config-type-option');
|
||||||
|
options.forEach(btn => {
|
||||||
|
btn.addEventListener('click', () => {
|
||||||
|
resolve(btn.dataset.type);
|
||||||
|
document.body.removeChild(dialog);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
const cancelBtn = dialog.querySelector('.cancel-btn');
|
||||||
|
cancelBtn.addEventListener('click', () => {
|
||||||
|
resolve(null);
|
||||||
|
document.body.removeChild(dialog);
|
||||||
|
});
|
||||||
|
|
||||||
|
// Close on backdrop click
|
||||||
|
dialog.addEventListener('click', (e) => {
|
||||||
|
if (e.target === dialog) {
|
||||||
|
resolve(null);
|
||||||
|
document.body.removeChild(dialog);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
async function removeMcpServerFromProject(serverName) {
|
async function removeMcpServerFromProject(serverName) {
|
||||||
try {
|
try {
|
||||||
const response = await fetch('/api/mcp-remove-server', {
|
const response = await fetch('/api/mcp-remove-server', {
|
||||||
|
|||||||
@@ -431,7 +431,31 @@ const i18n = {
|
|||||||
'mcp.jsonFormatsHint': 'Supports {"servers": {...}}, {"mcpServers": {...}}, and direct server config formats.',
|
'mcp.jsonFormatsHint': 'Supports {"servers": {...}}, {"mcpServers": {...}}, and direct server config formats.',
|
||||||
'mcp.previewServers': 'Preview (servers to be added):',
|
'mcp.previewServers': 'Preview (servers to be added):',
|
||||||
'mcp.create': 'Create',
|
'mcp.create': 'Create',
|
||||||
|
'mcp.chooseInstallLocation': 'Choose Installation Location',
|
||||||
|
'mcp.installToClaudeJson': 'Install to .claude.json',
|
||||||
|
'mcp.installToMcpJson': 'Install to .mcp.json (Recommended)',
|
||||||
|
'mcp.claudeJsonDesc': 'Save in root .claude.json projects section (shared config)',
|
||||||
|
'mcp.mcpJsonDesc': 'Save in project .mcp.json file (recommended for version control)',
|
||||||
|
|
||||||
|
// MCP Templates
|
||||||
|
'mcp.templates': 'MCP Templates',
|
||||||
|
'mcp.savedTemplates': 'saved templates',
|
||||||
|
'mcp.saveAsTemplate': 'Save as Template',
|
||||||
|
'mcp.enterTemplateName': 'Enter template name',
|
||||||
|
'mcp.enterTemplateDesc': 'Enter template description (optional)',
|
||||||
|
'mcp.enterServerName': 'Enter server name',
|
||||||
|
'mcp.templateSaved': 'Template "{name}" saved successfully',
|
||||||
|
'mcp.templateSaveFailed': 'Failed to save template: {error}',
|
||||||
|
'mcp.templateNotFound': 'Template "{name}" not found',
|
||||||
|
'mcp.templateInstalled': 'Server "{name}" installed successfully',
|
||||||
|
'mcp.templateInstallFailed': 'Failed to install template: {error}',
|
||||||
|
'mcp.deleteTemplate': 'Delete Template',
|
||||||
|
'mcp.deleteTemplateConfirm': 'Delete template "{name}"?',
|
||||||
|
'mcp.templateDeleted': 'Template "{name}" deleted successfully',
|
||||||
|
'mcp.templateDeleteFailed': 'Failed to delete template: {error}',
|
||||||
|
'mcp.toProject': 'To Project',
|
||||||
|
'mcp.toGlobal': 'To Global',
|
||||||
|
|
||||||
// Hook Manager
|
// Hook Manager
|
||||||
'hook.projectHooks': 'Project Hooks',
|
'hook.projectHooks': 'Project Hooks',
|
||||||
'hook.projectFile': '.claude/settings.json',
|
'hook.projectFile': '.claude/settings.json',
|
||||||
@@ -1346,6 +1370,11 @@ const i18n = {
|
|||||||
'mcp.jsonFormatsHint': '支持 {"servers": {...}}、{"mcpServers": {...}} 和直接服务器配置格式。',
|
'mcp.jsonFormatsHint': '支持 {"servers": {...}}、{"mcpServers": {...}} 和直接服务器配置格式。',
|
||||||
'mcp.previewServers': '预览(将添加的服务器):',
|
'mcp.previewServers': '预览(将添加的服务器):',
|
||||||
'mcp.create': '创建',
|
'mcp.create': '创建',
|
||||||
|
'mcp.chooseInstallLocation': '选择安装位置',
|
||||||
|
'mcp.installToClaudeJson': '安装到 .claude.json',
|
||||||
|
'mcp.installToMcpJson': '安装到 .mcp.json(推荐)',
|
||||||
|
'mcp.claudeJsonDesc': '保存在根目录 .claude.json projects 字段下(共享配置)',
|
||||||
|
'mcp.mcpJsonDesc': '保存在项目 .mcp.json 文件中(推荐用于版本控制)',
|
||||||
|
|
||||||
// Hook Manager
|
// Hook Manager
|
||||||
'hook.projectHooks': '项目钩子',
|
'hook.projectHooks': '项目钩子',
|
||||||
|
|||||||
@@ -43,6 +43,9 @@ async function renderMcpManager() {
|
|||||||
await loadMcpConfig();
|
await loadMcpConfig();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Load MCP templates
|
||||||
|
await loadMcpTemplates();
|
||||||
|
|
||||||
const currentPath = projectPath.replace(/\//g, '\\');
|
const currentPath = projectPath.replace(/\//g, '\\');
|
||||||
const projectData = mcpAllProjects[currentPath] || {};
|
const projectData = mcpAllProjects[currentPath] || {};
|
||||||
const projectServers = projectData.mcpServers || {};
|
const projectServers = projectData.mcpServers || {};
|
||||||
@@ -269,6 +272,77 @@ async function renderMcpManager() {
|
|||||||
`}
|
`}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<!-- MCP Templates Section -->
|
||||||
|
${mcpTemplates.length > 0 ? `
|
||||||
|
<div class="mcp-section mt-6">
|
||||||
|
<div class="flex items-center justify-between mb-4">
|
||||||
|
<h3 class="text-lg font-semibold text-foreground flex items-center gap-2">
|
||||||
|
<i data-lucide="layout-template" class="w-5 h-5"></i>
|
||||||
|
${t('mcp.templates')}
|
||||||
|
</h3>
|
||||||
|
<span class="text-sm text-muted-foreground">${mcpTemplates.length} ${t('mcp.savedTemplates')}</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
|
||||||
|
${mcpTemplates.map(template => `
|
||||||
|
<div class="mcp-template-card bg-card border border-border rounded-lg p-4 hover:shadow-md transition-all">
|
||||||
|
<div class="flex items-start justify-between mb-3">
|
||||||
|
<div class="flex-1 min-w-0">
|
||||||
|
<h4 class="font-semibold text-foreground truncate flex items-center gap-2">
|
||||||
|
<i data-lucide="layout-template" class="w-4 h-4 shrink-0"></i>
|
||||||
|
<span class="truncate">${escapeHtml(template.name)}</span>
|
||||||
|
</h4>
|
||||||
|
${template.description ? `
|
||||||
|
<p class="text-xs text-muted-foreground mt-1 line-clamp-2">${escapeHtml(template.description)}</p>
|
||||||
|
` : ''}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mcp-server-details text-sm space-y-1 mb-3">
|
||||||
|
<div class="flex items-center gap-2 text-muted-foreground">
|
||||||
|
<span class="font-mono text-xs bg-muted px-1.5 py-0.5 rounded">cmd</span>
|
||||||
|
<span class="truncate text-xs" title="${escapeHtml(template.serverConfig.command)}">${escapeHtml(template.serverConfig.command)}</span>
|
||||||
|
</div>
|
||||||
|
${template.serverConfig.args && template.serverConfig.args.length > 0 ? `
|
||||||
|
<div class="flex items-start gap-2 text-muted-foreground">
|
||||||
|
<span class="font-mono text-xs bg-muted px-1.5 py-0.5 rounded shrink-0">args</span>
|
||||||
|
<span class="text-xs font-mono truncate" title="${escapeHtml(template.serverConfig.args.join(' '))}">${escapeHtml(template.serverConfig.args.slice(0, 2).join(' '))}${template.serverConfig.args.length > 2 ? '...' : ''}</span>
|
||||||
|
</div>
|
||||||
|
` : ''}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="mt-3 pt-3 border-t border-border flex items-center justify-between gap-2">
|
||||||
|
<div class="flex items-center gap-2">
|
||||||
|
<button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
|
||||||
|
data-template-name="${escapeHtml(template.name)}"
|
||||||
|
data-scope="project"
|
||||||
|
data-action="install-template"
|
||||||
|
title="${t('mcp.installToProject')}">
|
||||||
|
<i data-lucide="download" class="w-3 h-3"></i>
|
||||||
|
${t('mcp.toProject')}
|
||||||
|
</button>
|
||||||
|
<button class="text-xs text-success hover:text-success/80 transition-colors flex items-center gap-1"
|
||||||
|
data-template-name="${escapeHtml(template.name)}"
|
||||||
|
data-scope="global"
|
||||||
|
data-action="install-template"
|
||||||
|
title="${t('mcp.installToGlobal')}">
|
||||||
|
<i data-lucide="globe" class="w-3 h-3"></i>
|
||||||
|
${t('mcp.toGlobal')}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
<button class="text-xs text-destructive hover:text-destructive/80 transition-colors"
|
||||||
|
data-template-name="${escapeHtml(template.name)}"
|
||||||
|
data-action="delete-template"
|
||||||
|
title="${t('mcp.deleteTemplate')}">
|
||||||
|
<i data-lucide="trash-2" class="w-3 h-3"></i>
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`).join('')}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
` : ''}
|
||||||
|
|
||||||
<!-- All Projects MCP Overview Table -->
|
<!-- All Projects MCP Overview Table -->
|
||||||
<div class="mcp-section mt-6">
|
<div class="mcp-section mt-6">
|
||||||
<div class="flex items-center justify-between mb-4">
|
<div class="flex items-center justify-between mb-4">
|
||||||
@@ -402,15 +476,25 @@ function renderProjectAvailableServerCard(entry) {
|
|||||||
` : ''}
|
` : ''}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="mt-3 pt-3 border-t border-border flex items-center justify-between">
|
<div class="mt-3 pt-3 border-t border-border flex items-center justify-between gap-2">
|
||||||
<button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
|
<div class="flex items-center gap-2">
|
||||||
data-server-name="${escapeHtml(name)}"
|
<button class="text-xs text-primary hover:text-primary/80 transition-colors flex items-center gap-1"
|
||||||
data-server-config="${escapeHtml(JSON.stringify(config))}"
|
data-server-name="${escapeHtml(name)}"
|
||||||
data-scope="${source === 'global' ? 'global' : 'project'}"
|
data-server-config="${escapeHtml(JSON.stringify(config))}"
|
||||||
data-action="copy-install-cmd">
|
data-scope="${source === 'global' ? 'global' : 'project'}"
|
||||||
<i data-lucide="copy" class="w-3 h-3"></i>
|
data-action="copy-install-cmd">
|
||||||
${t('mcp.copyInstallCmd')}
|
<i data-lucide="copy" class="w-3 h-3"></i>
|
||||||
</button>
|
${t('mcp.copyInstallCmd')}
|
||||||
|
</button>
|
||||||
|
<button class="text-xs text-success hover:text-success/80 transition-colors flex items-center gap-1"
|
||||||
|
data-server-name="${escapeHtml(name)}"
|
||||||
|
data-server-config="${escapeHtml(JSON.stringify(config))}"
|
||||||
|
data-action="save-as-template"
|
||||||
|
title="${t('mcp.saveAsTemplate')}">
|
||||||
|
<i data-lucide="save" class="w-3 h-3"></i>
|
||||||
|
${t('mcp.saveAsTemplate')}
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
${canRemove ? `
|
${canRemove ? `
|
||||||
<button class="text-xs text-destructive hover:text-destructive/80 transition-colors"
|
<button class="text-xs text-destructive hover:text-destructive/80 transition-colors"
|
||||||
data-server-name="${escapeHtml(name)}"
|
data-server-name="${escapeHtml(name)}"
|
||||||
@@ -617,4 +701,156 @@ function attachMcpEventListeners() {
|
|||||||
await copyMcpInstallCommand(serverName, serverConfig, scope);
|
await copyMcpInstallCommand(serverName, serverConfig, scope);
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// Save as template buttons
|
||||||
|
document.querySelectorAll('.mcp-server-card button[data-action="save-as-template"]').forEach(btn => {
|
||||||
|
btn.addEventListener('click', async (e) => {
|
||||||
|
const serverName = btn.dataset.serverName;
|
||||||
|
const serverConfig = JSON.parse(btn.dataset.serverConfig);
|
||||||
|
await saveMcpAsTemplate(serverName, serverConfig);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Install from template buttons
|
||||||
|
document.querySelectorAll('.mcp-template-card button[data-action="install-template"]').forEach(btn => {
|
||||||
|
btn.addEventListener('click', async (e) => {
|
||||||
|
const templateName = btn.dataset.templateName;
|
||||||
|
const scope = btn.dataset.scope || 'project';
|
||||||
|
await installFromTemplate(templateName, scope);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// Delete template buttons
|
||||||
|
document.querySelectorAll('.mcp-template-card button[data-action="delete-template"]').forEach(btn => {
|
||||||
|
btn.addEventListener('click', async (e) => {
|
||||||
|
const templateName = btn.dataset.templateName;
|
||||||
|
if (confirm(t('mcp.deleteTemplateConfirm', { name: templateName }))) {
|
||||||
|
await deleteMcpTemplate(templateName);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// ========================================
|
||||||
|
// MCP Template Management Functions
|
||||||
|
// ========================================
|
||||||
|
|
||||||
|
let mcpTemplates = [];
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Load all MCP templates from API
|
||||||
|
*/
|
||||||
|
async function loadMcpTemplates() {
|
||||||
|
try {
|
||||||
|
const response = await fetch('/api/mcp-templates');
|
||||||
|
const data = await response.json();
|
||||||
|
|
||||||
|
if (data.success) {
|
||||||
|
mcpTemplates = data.templates || [];
|
||||||
|
console.log('[MCP Templates] Loaded', mcpTemplates.length, 'templates');
|
||||||
|
} else {
|
||||||
|
console.error('[MCP Templates] Failed to load:', data.error);
|
||||||
|
mcpTemplates = [];
|
||||||
|
}
|
||||||
|
|
||||||
|
return mcpTemplates;
|
||||||
|
} catch (error) {
|
||||||
|
console.error('[MCP Templates] Error loading templates:', error);
|
||||||
|
mcpTemplates = [];
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Save MCP server configuration as a template
|
||||||
|
*/
|
||||||
|
async function saveMcpAsTemplate(serverName, serverConfig) {
|
||||||
|
try {
|
||||||
|
// Prompt for template name and description
|
||||||
|
const templateName = prompt(t('mcp.enterTemplateName'), serverName);
|
||||||
|
if (!templateName) return;
|
||||||
|
|
||||||
|
const description = prompt(t('mcp.enterTemplateDesc'), `Template for ${serverName}`);
|
||||||
|
|
||||||
|
const payload = {
|
||||||
|
name: templateName,
|
||||||
|
description: description || '',
|
||||||
|
serverConfig: serverConfig,
|
||||||
|
category: 'user'
|
||||||
|
};
|
||||||
|
|
||||||
|
const response = await fetch('/api/mcp-templates', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify(payload)
|
||||||
|
});
|
||||||
|
|
||||||
|
const data = await response.json();
|
||||||
|
|
||||||
|
if (data.success) {
|
||||||
|
showNotification(t('mcp.templateSaved', { name: templateName }), 'success');
|
||||||
|
await loadMcpTemplates();
|
||||||
|
await renderMcpManager(); // Refresh view
|
||||||
|
} else {
|
||||||
|
showNotification(t('mcp.templateSaveFailed', { error: data.error }), 'error');
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error('[MCP] Save template error:', error);
|
||||||
|
showNotification(t('mcp.templateSaveFailed', { error: error.message }), 'error');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Install MCP server from template
|
||||||
|
*/
|
||||||
|
async function installFromTemplate(templateName, scope = 'project') {
|
||||||
|
try {
|
||||||
|
// Find template
|
||||||
|
const template = mcpTemplates.find(t => t.name === templateName);
|
||||||
|
if (!template) {
|
||||||
|
showNotification(t('mcp.templateNotFound', { name: templateName }), 'error');
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Prompt for server name (default to template name)
|
||||||
|
const serverName = prompt(t('mcp.enterServerName'), templateName);
|
||||||
|
if (!serverName) return;
|
||||||
|
|
||||||
|
// Install based on scope
|
||||||
|
if (scope === 'project') {
|
||||||
|
await installMcpToProject(serverName, template.serverConfig);
|
||||||
|
} else if (scope === 'global') {
|
||||||
|
await addGlobalMcpServer(serverName, template.serverConfig);
|
||||||
|
}
|
||||||
|
|
||||||
|
showNotification(t('mcp.templateInstalled', { name: serverName }), 'success');
|
||||||
|
await renderMcpManager();
|
||||||
|
} catch (error) {
|
||||||
|
console.error('[MCP] Install from template error:', error);
|
||||||
|
showNotification(t('mcp.templateInstallFailed', { error: error.message }), 'error');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Delete MCP template
|
||||||
|
*/
|
||||||
|
async function deleteMcpTemplate(templateName) {
|
||||||
|
try {
|
||||||
|
const response = await fetch(`/api/mcp-templates/${encodeURIComponent(templateName)}`, {
|
||||||
|
method: 'DELETE'
|
||||||
|
});
|
||||||
|
|
||||||
|
const data = await response.json();
|
||||||
|
|
||||||
|
if (data.success) {
|
||||||
|
showNotification(t('mcp.templateDeleted', { name: templateName }), 'success');
|
||||||
|
await loadMcpTemplates();
|
||||||
|
await renderMcpManager();
|
||||||
|
} else {
|
||||||
|
showNotification(t('mcp.templateDeleteFailed', { error: data.error }), 'error');
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error('[MCP] Delete template error:', error);
|
||||||
|
showNotification(t('mcp.templateDeleteFailed', { error: error.message }), 'error');
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -588,9 +588,21 @@ function closeRuleCreateModal(event) {
|
|||||||
|
|
||||||
function selectRuleLocation(location) {
|
function selectRuleLocation(location) {
|
||||||
ruleCreateState.location = location;
|
ruleCreateState.location = location;
|
||||||
// Re-render modal
|
|
||||||
closeRuleCreateModal();
|
// Update button styles without re-rendering modal
|
||||||
openRuleCreateModal();
|
const buttons = document.querySelectorAll('.location-btn');
|
||||||
|
buttons.forEach(btn => {
|
||||||
|
const isProject = btn.querySelector('.font-medium')?.textContent?.includes(t('rules.projectRules'));
|
||||||
|
const isUser = btn.querySelector('.font-medium')?.textContent?.includes(t('rules.userRules'));
|
||||||
|
|
||||||
|
if ((isProject && location === 'project') || (isUser && location === 'user')) {
|
||||||
|
btn.classList.remove('border-border', 'hover:border-primary/50');
|
||||||
|
btn.classList.add('border-primary', 'bg-primary/10');
|
||||||
|
} else {
|
||||||
|
btn.classList.remove('border-primary', 'bg-primary/10');
|
||||||
|
btn.classList.add('border-border', 'hover:border-primary/50');
|
||||||
|
}
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
function toggleRuleConditional() {
|
function toggleRuleConditional() {
|
||||||
|
|||||||
@@ -569,9 +569,21 @@ function closeSkillCreateModal(event) {
|
|||||||
|
|
||||||
function selectSkillLocation(location) {
|
function selectSkillLocation(location) {
|
||||||
skillCreateState.location = location;
|
skillCreateState.location = location;
|
||||||
// Re-render modal
|
|
||||||
closeSkillCreateModal();
|
// Update button styles without re-rendering modal
|
||||||
openSkillCreateModal();
|
const buttons = document.querySelectorAll('.location-btn');
|
||||||
|
buttons.forEach(btn => {
|
||||||
|
const isProject = btn.querySelector('.font-medium')?.textContent?.includes(t('skills.projectSkills'));
|
||||||
|
const isUser = btn.querySelector('.font-medium')?.textContent?.includes(t('skills.userSkills'));
|
||||||
|
|
||||||
|
if ((isProject && location === 'project') || (isUser && location === 'user')) {
|
||||||
|
btn.classList.remove('border-border', 'hover:border-primary/50');
|
||||||
|
btn.classList.add('border-primary', 'bg-primary/10');
|
||||||
|
} else {
|
||||||
|
btn.classList.remove('border-primary', 'bg-primary/10');
|
||||||
|
btn.classList.add('border-border', 'hover:border-primary/50');
|
||||||
|
}
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
function switchSkillCreateMode(mode) {
|
function switchSkillCreateMode(mode) {
|
||||||
|
|||||||
1010
codex-lens/docs/DESIGN_EVALUATION_REPORT.md
Normal file
1010
codex-lens/docs/DESIGN_EVALUATION_REPORT.md
Normal file
File diff suppressed because it is too large
Load Diff
972
codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
Normal file
972
codex-lens/docs/DOCSTRING_LLM_HYBRID_DESIGN.md
Normal file
@@ -0,0 +1,972 @@
|
|||||||
|
# Docstring与LLM混合策略设计方案
|
||||||
|
|
||||||
|
## 1. 背景与目标
|
||||||
|
|
||||||
|
### 1.1 当前问题
|
||||||
|
|
||||||
|
现有 `llm_enhancer.py` 的实现存在以下问题:
|
||||||
|
|
||||||
|
1. **忽略已有文档**:对所有代码无差别调用LLM,即使已有高质量的docstring
|
||||||
|
2. **成本浪费**:重复生成已有信息,增加API调用费用和时间
|
||||||
|
3. **信息质量不一致**:LLM生成的内容可能不如作者编写的docstring准确
|
||||||
|
4. **缺少作者意图**:丢失了docstring中的设计决策、使用示例等关键信息
|
||||||
|
|
||||||
|
### 1.2 设计目标
|
||||||
|
|
||||||
|
实现**智能混合策略**,结合docstring和LLM的优势:
|
||||||
|
|
||||||
|
1. **优先使用docstring**:作为最权威的信息源
|
||||||
|
2. **LLM作为补充**:填补docstring缺失或质量不足的部分
|
||||||
|
3. **智能质量评估**:自动判断docstring质量,决定是否需要LLM增强
|
||||||
|
4. **成本优化**:减少不必要的LLM调用,降低API费用
|
||||||
|
5. **信息融合**:将docstring和LLM生成的内容有机结合
|
||||||
|
|
||||||
|
## 2. 技术架构
|
||||||
|
|
||||||
|
### 2.1 整体流程
|
||||||
|
|
||||||
|
```
|
||||||
|
Code Symbol
|
||||||
|
↓
|
||||||
|
[Docstring Extractor] ← 提取docstring
|
||||||
|
↓
|
||||||
|
[Quality Evaluator] ← 评估docstring质量
|
||||||
|
↓
|
||||||
|
├─ High Quality → Use Docstring Directly
|
||||||
|
│ + LLM Generate Keywords Only
|
||||||
|
│
|
||||||
|
├─ Medium Quality → LLM Refine & Enhance
|
||||||
|
│ (docstring作为base)
|
||||||
|
│
|
||||||
|
└─ Low/No Docstring → LLM Full Generation
|
||||||
|
(现有流程)
|
||||||
|
↓
|
||||||
|
[Metadata Merger] ← 合并docstring和LLM内容
|
||||||
|
↓
|
||||||
|
Final SemanticMetadata
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 核心组件
|
||||||
|
|
||||||
|
```python
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from enum import Enum
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
class DocstringQuality(Enum):
|
||||||
|
"""Docstring质量等级"""
|
||||||
|
MISSING = "missing" # 无docstring
|
||||||
|
LOW = "low" # 质量低:<10字符或纯占位符
|
||||||
|
MEDIUM = "medium" # 质量中:有基本描述但不完整
|
||||||
|
HIGH = "high" # 质量高:详细且结构化
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DocstringMetadata:
|
||||||
|
"""从docstring提取的元数据"""
|
||||||
|
raw_text: str
|
||||||
|
quality: DocstringQuality
|
||||||
|
summary: Optional[str] = None # 提取的摘要
|
||||||
|
parameters: Optional[dict] = None # 参数说明
|
||||||
|
returns: Optional[str] = None # 返回值说明
|
||||||
|
examples: Optional[str] = None # 使用示例
|
||||||
|
notes: Optional[str] = None # 注意事项
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. 详细实现步骤
|
||||||
|
|
||||||
|
### 3.1 Docstring提取与解析
|
||||||
|
|
||||||
|
```python
|
||||||
|
import re
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
class DocstringExtractor:
|
||||||
|
"""Docstring提取器"""
|
||||||
|
|
||||||
|
# Docstring风格正则
|
||||||
|
GOOGLE_STYLE_PATTERN = re.compile(
|
||||||
|
r'Args:|Returns:|Raises:|Examples:|Note:',
|
||||||
|
re.MULTILINE
|
||||||
|
)
|
||||||
|
|
||||||
|
NUMPY_STYLE_PATTERN = re.compile(
|
||||||
|
r'Parameters\n-+|Returns\n-+|Examples\n-+',
|
||||||
|
re.MULTILINE
|
||||||
|
)
|
||||||
|
|
||||||
|
def extract_from_code(self, content: str, symbol: Symbol) -> Optional[str]:
|
||||||
|
"""从代码中提取docstring"""
|
||||||
|
|
||||||
|
lines = content.splitlines()
|
||||||
|
start_line = symbol.range[0] - 1 # 0-indexed
|
||||||
|
|
||||||
|
# 查找函数定义后的第一个字符串字面量
|
||||||
|
# 通常在函数定义的下一行或几行内
|
||||||
|
for i in range(start_line + 1, min(start_line + 10, len(lines))):
|
||||||
|
line = lines[i].strip()
|
||||||
|
|
||||||
|
# Python triple-quoted string
|
||||||
|
if line.startswith('"""') or line.startswith("'''"):
|
||||||
|
return self._extract_multiline_docstring(lines, i)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _extract_multiline_docstring(
|
||||||
|
self,
|
||||||
|
lines: List[str],
|
||||||
|
start_idx: int
|
||||||
|
) -> str:
|
||||||
|
"""提取多行docstring"""
|
||||||
|
|
||||||
|
quote_char = '"""' if lines[start_idx].strip().startswith('"""') else "'''"
|
||||||
|
docstring_lines = []
|
||||||
|
|
||||||
|
# 检查是否单行docstring
|
||||||
|
first_line = lines[start_idx].strip()
|
||||||
|
if first_line.count(quote_char) == 2:
|
||||||
|
# 单行: """This is a docstring."""
|
||||||
|
return first_line.strip(quote_char).strip()
|
||||||
|
|
||||||
|
# 多行docstring
|
||||||
|
in_docstring = True
|
||||||
|
for i in range(start_idx, len(lines)):
|
||||||
|
line = lines[i]
|
||||||
|
|
||||||
|
if i == start_idx:
|
||||||
|
# 第一行:移除开始的引号
|
||||||
|
docstring_lines.append(line.strip().lstrip(quote_char))
|
||||||
|
elif quote_char in line:
|
||||||
|
# 结束行:移除结束的引号
|
||||||
|
docstring_lines.append(line.strip().rstrip(quote_char))
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
docstring_lines.append(line.strip())
|
||||||
|
|
||||||
|
return '\n'.join(docstring_lines).strip()
|
||||||
|
|
||||||
|
def parse_docstring(self, raw_docstring: str) -> DocstringMetadata:
|
||||||
|
"""解析docstring,提取结构化信息"""
|
||||||
|
|
||||||
|
if not raw_docstring:
|
||||||
|
return DocstringMetadata(
|
||||||
|
raw_text="",
|
||||||
|
quality=DocstringQuality.MISSING
|
||||||
|
)
|
||||||
|
|
||||||
|
# 评估质量
|
||||||
|
quality = self._evaluate_quality(raw_docstring)
|
||||||
|
|
||||||
|
# 提取各个部分
|
||||||
|
metadata = DocstringMetadata(
|
||||||
|
raw_text=raw_docstring,
|
||||||
|
quality=quality,
|
||||||
|
)
|
||||||
|
|
||||||
|
# 提取摘要(第一行或第一段)
|
||||||
|
metadata.summary = self._extract_summary(raw_docstring)
|
||||||
|
|
||||||
|
# 如果是Google或NumPy风格,提取结构化内容
|
||||||
|
if self.GOOGLE_STYLE_PATTERN.search(raw_docstring):
|
||||||
|
self._parse_google_style(raw_docstring, metadata)
|
||||||
|
elif self.NUMPY_STYLE_PATTERN.search(raw_docstring):
|
||||||
|
self._parse_numpy_style(raw_docstring, metadata)
|
||||||
|
|
||||||
|
return metadata
|
||||||
|
|
||||||
|
def _evaluate_quality(self, docstring: str) -> DocstringQuality:
|
||||||
|
"""评估docstring质量"""
|
||||||
|
|
||||||
|
if not docstring or len(docstring.strip()) == 0:
|
||||||
|
return DocstringQuality.MISSING
|
||||||
|
|
||||||
|
# 检查是否是占位符
|
||||||
|
placeholders = ['todo', 'fixme', 'tbd', 'placeholder', '...']
|
||||||
|
if any(p in docstring.lower() for p in placeholders):
|
||||||
|
return DocstringQuality.LOW
|
||||||
|
|
||||||
|
# 长度检查
|
||||||
|
if len(docstring.strip()) < 10:
|
||||||
|
return DocstringQuality.LOW
|
||||||
|
|
||||||
|
# 检查是否有结构化内容
|
||||||
|
has_structure = (
|
||||||
|
self.GOOGLE_STYLE_PATTERN.search(docstring) or
|
||||||
|
self.NUMPY_STYLE_PATTERN.search(docstring)
|
||||||
|
)
|
||||||
|
|
||||||
|
# 检查是否有足够的描述性文本
|
||||||
|
word_count = len(docstring.split())
|
||||||
|
|
||||||
|
if has_structure and word_count >= 20:
|
||||||
|
return DocstringQuality.HIGH
|
||||||
|
elif word_count >= 10:
|
||||||
|
return DocstringQuality.MEDIUM
|
||||||
|
else:
|
||||||
|
return DocstringQuality.LOW
|
||||||
|
|
||||||
|
def _extract_summary(self, docstring: str) -> str:
|
||||||
|
"""提取摘要(第一行或第一段)"""
|
||||||
|
|
||||||
|
lines = docstring.split('\n')
|
||||||
|
# 第一行非空行作为摘要
|
||||||
|
for line in lines:
|
||||||
|
if line.strip():
|
||||||
|
return line.strip()
|
||||||
|
|
||||||
|
return ""
|
||||||
|
|
||||||
|
def _parse_google_style(self, docstring: str, metadata: DocstringMetadata):
|
||||||
|
"""解析Google风格docstring"""
|
||||||
|
|
||||||
|
# 提取Args
|
||||||
|
args_match = re.search(r'Args:(.*?)(?=Returns:|Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
|
||||||
|
if args_match:
|
||||||
|
metadata.parameters = self._parse_args_section(args_match.group(1))
|
||||||
|
|
||||||
|
# 提取Returns
|
||||||
|
returns_match = re.search(r'Returns:(.*?)(?=Raises:|Examples:|Note:|\Z)', docstring, re.DOTALL)
|
||||||
|
if returns_match:
|
||||||
|
metadata.returns = returns_match.group(1).strip()
|
||||||
|
|
||||||
|
# 提取Examples
|
||||||
|
examples_match = re.search(r'Examples:(.*?)(?=Note:|\Z)', docstring, re.DOTALL)
|
||||||
|
if examples_match:
|
||||||
|
metadata.examples = examples_match.group(1).strip()
|
||||||
|
|
||||||
|
def _parse_args_section(self, args_text: str) -> dict:
|
||||||
|
"""解析参数列表"""
|
||||||
|
|
||||||
|
params = {}
|
||||||
|
# 匹配 "param_name (type): description" 或 "param_name: description"
|
||||||
|
pattern = re.compile(r'(\w+)\s*(?:\(([^)]+)\))?\s*:\s*(.+)')
|
||||||
|
|
||||||
|
for line in args_text.split('\n'):
|
||||||
|
match = pattern.search(line.strip())
|
||||||
|
if match:
|
||||||
|
param_name, param_type, description = match.groups()
|
||||||
|
params[param_name] = {
|
||||||
|
'type': param_type,
|
||||||
|
'description': description.strip()
|
||||||
|
}
|
||||||
|
|
||||||
|
return params
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 智能混合策略引擎
|
||||||
|
|
||||||
|
```python
|
||||||
|
class HybridEnhancer:
|
||||||
|
"""Docstring与LLM混合增强器"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
llm_enhancer: LLMEnhancer,
|
||||||
|
docstring_extractor: DocstringExtractor
|
||||||
|
):
|
||||||
|
self.llm_enhancer = llm_enhancer
|
||||||
|
self.docstring_extractor = docstring_extractor
|
||||||
|
|
||||||
|
def enhance_with_strategy(
|
||||||
|
self,
|
||||||
|
file_data: FileData,
|
||||||
|
symbols: List[Symbol]
|
||||||
|
) -> Dict[str, SemanticMetadata]:
|
||||||
|
"""根据docstring质量选择增强策略"""
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
|
||||||
|
for symbol in symbols:
|
||||||
|
# 1. 提取并解析docstring
|
||||||
|
raw_docstring = self.docstring_extractor.extract_from_code(
|
||||||
|
file_data.content, symbol
|
||||||
|
)
|
||||||
|
doc_metadata = self.docstring_extractor.parse_docstring(raw_docstring or "")
|
||||||
|
|
||||||
|
# 2. 根据质量选择策略
|
||||||
|
semantic_metadata = self._apply_strategy(
|
||||||
|
file_data, symbol, doc_metadata
|
||||||
|
)
|
||||||
|
|
||||||
|
results[symbol.name] = semantic_metadata
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def _apply_strategy(
|
||||||
|
self,
|
||||||
|
file_data: FileData,
|
||||||
|
symbol: Symbol,
|
||||||
|
doc_metadata: DocstringMetadata
|
||||||
|
) -> SemanticMetadata:
|
||||||
|
"""应用混合策略"""
|
||||||
|
|
||||||
|
quality = doc_metadata.quality
|
||||||
|
|
||||||
|
if quality == DocstringQuality.HIGH:
|
||||||
|
# 高质量:直接使用docstring,只用LLM生成keywords
|
||||||
|
return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
|
||||||
|
|
||||||
|
elif quality == DocstringQuality.MEDIUM:
|
||||||
|
# 中等质量:让LLM精炼和增强
|
||||||
|
return self._refine_with_llm(file_data, symbol, doc_metadata)
|
||||||
|
|
||||||
|
else: # LOW or MISSING
|
||||||
|
# 低质量或无:完全由LLM生成
|
||||||
|
return self._full_llm_generation(file_data, symbol)
|
||||||
|
|
||||||
|
def _use_docstring_with_llm_keywords(
|
||||||
|
self,
|
||||||
|
symbol: Symbol,
|
||||||
|
doc_metadata: DocstringMetadata
|
||||||
|
) -> SemanticMetadata:
|
||||||
|
"""策略1:使用docstring,LLM只生成keywords"""
|
||||||
|
|
||||||
|
# 直接使用docstring的摘要
|
||||||
|
summary = doc_metadata.summary or doc_metadata.raw_text[:200]
|
||||||
|
|
||||||
|
# 使用LLM生成keywords
|
||||||
|
keywords = self._generate_keywords_only(summary, symbol.name)
|
||||||
|
|
||||||
|
# 从docstring推断purpose
|
||||||
|
purpose = self._infer_purpose_from_docstring(doc_metadata)
|
||||||
|
|
||||||
|
return SemanticMetadata(
|
||||||
|
summary=summary,
|
||||||
|
keywords=keywords,
|
||||||
|
purpose=purpose,
|
||||||
|
file_path=symbol.file_path if hasattr(symbol, 'file_path') else None,
|
||||||
|
symbol_name=symbol.name,
|
||||||
|
llm_tool="hybrid_docstring_primary",
|
||||||
|
)
|
||||||
|
|
||||||
|
def _refine_with_llm(
|
||||||
|
self,
|
||||||
|
file_data: FileData,
|
||||||
|
symbol: Symbol,
|
||||||
|
doc_metadata: DocstringMetadata
|
||||||
|
) -> SemanticMetadata:
|
||||||
|
"""策略2:让LLM精炼和增强docstring"""
|
||||||
|
|
||||||
|
prompt = f"""
|
||||||
|
PURPOSE: Refine and enhance an existing docstring for better semantic search
|
||||||
|
TASK:
|
||||||
|
- Review the existing docstring
|
||||||
|
- Generate a concise summary (1-2 sentences) that captures the core purpose
|
||||||
|
- Extract 8-12 relevant keywords for search
|
||||||
|
- Identify the functional category/purpose
|
||||||
|
|
||||||
|
EXISTING DOCSTRING:
|
||||||
|
{doc_metadata.raw_text}
|
||||||
|
|
||||||
|
CODE CONTEXT:
|
||||||
|
Function: {symbol.name}
|
||||||
|
```{file_data.language}
|
||||||
|
{self._get_symbol_code(file_data.content, symbol)}
|
||||||
|
```
|
||||||
|
|
||||||
|
OUTPUT: JSON format
|
||||||
|
{{
|
||||||
|
"summary": "refined summary based on docstring and code",
|
||||||
|
"keywords": ["keyword1", "keyword2", ...],
|
||||||
|
"purpose": "category"
|
||||||
|
}}
|
||||||
|
"""
|
||||||
|
|
||||||
|
response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
|
||||||
|
if response['success']:
|
||||||
|
data = json.loads(self.llm_enhancer._extract_json(response['stdout']))
|
||||||
|
return SemanticMetadata(
|
||||||
|
summary=data.get('summary', doc_metadata.summary),
|
||||||
|
keywords=data.get('keywords', []),
|
||||||
|
purpose=data.get('purpose', 'unknown'),
|
||||||
|
file_path=file_data.path,
|
||||||
|
symbol_name=symbol.name,
|
||||||
|
llm_tool="hybrid_llm_refined",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fallback: 使用docstring
|
||||||
|
return self._use_docstring_with_llm_keywords(symbol, doc_metadata)
|
||||||
|
|
||||||
|
def _full_llm_generation(
|
||||||
|
self,
|
||||||
|
file_data: FileData,
|
||||||
|
symbol: Symbol
|
||||||
|
) -> SemanticMetadata:
|
||||||
|
"""策略3:完全由LLM生成(原有流程)"""
|
||||||
|
|
||||||
|
# 复用现有的LLM enhancer
|
||||||
|
code_snippet = self._get_symbol_code(file_data.content, symbol)
|
||||||
|
|
||||||
|
results = self.llm_enhancer.enhance_files([
|
||||||
|
FileData(
|
||||||
|
path=f"{file_data.path}:{symbol.name}",
|
||||||
|
content=code_snippet,
|
||||||
|
language=file_data.language
|
||||||
|
)
|
||||||
|
])
|
||||||
|
|
||||||
|
return results.get(f"{file_data.path}:{symbol.name}", SemanticMetadata(
|
||||||
|
summary="",
|
||||||
|
keywords=[],
|
||||||
|
purpose="unknown",
|
||||||
|
file_path=file_data.path,
|
||||||
|
symbol_name=symbol.name,
|
||||||
|
llm_tool="hybrid_llm_full",
|
||||||
|
))
|
||||||
|
|
||||||
|
def _generate_keywords_only(self, summary: str, symbol_name: str) -> List[str]:
|
||||||
|
"""仅生成keywords(快速LLM调用)"""
|
||||||
|
|
||||||
|
prompt = f"""
|
||||||
|
PURPOSE: Generate search keywords for a code function
|
||||||
|
TASK: Extract 5-8 relevant keywords from the summary
|
||||||
|
|
||||||
|
Summary: {summary}
|
||||||
|
Function Name: {symbol_name}
|
||||||
|
|
||||||
|
OUTPUT: Comma-separated keywords
|
||||||
|
"""
|
||||||
|
|
||||||
|
response = self.llm_enhancer._invoke_ccw_cli(prompt, tool='gemini')
|
||||||
|
if response['success']:
|
||||||
|
keywords_str = response['stdout'].strip()
|
||||||
|
return [k.strip() for k in keywords_str.split(',')]
|
||||||
|
|
||||||
|
# Fallback: 从摘要提取关键词
|
||||||
|
return self._extract_keywords_heuristic(summary)
|
||||||
|
|
||||||
|
def _extract_keywords_heuristic(self, text: str) -> List[str]:
|
||||||
|
"""启发式关键词提取(无需LLM)"""
|
||||||
|
|
||||||
|
# 简单实现:提取名词性词组
|
||||||
|
import re
|
||||||
|
words = re.findall(r'\b[a-z]{4,}\b', text.lower())
|
||||||
|
|
||||||
|
# 过滤常见词
|
||||||
|
stopwords = {'this', 'that', 'with', 'from', 'have', 'will', 'your', 'their'}
|
||||||
|
keywords = [w for w in words if w not in stopwords]
|
||||||
|
|
||||||
|
return list(set(keywords))[:8]
|
||||||
|
|
||||||
|
def _infer_purpose_from_docstring(self, doc_metadata: DocstringMetadata) -> str:
|
||||||
|
"""从docstring推断purpose(无需LLM)"""
|
||||||
|
|
||||||
|
summary = doc_metadata.summary.lower()
|
||||||
|
|
||||||
|
# 简单规则匹配
|
||||||
|
if 'authenticate' in summary or 'login' in summary:
|
||||||
|
return 'auth'
|
||||||
|
elif 'validate' in summary or 'check' in summary:
|
||||||
|
return 'validation'
|
||||||
|
elif 'parse' in summary or 'format' in summary:
|
||||||
|
return 'data_processing'
|
||||||
|
elif 'api' in summary or 'endpoint' in summary:
|
||||||
|
return 'api'
|
||||||
|
elif 'database' in summary or 'query' in summary:
|
||||||
|
return 'data'
|
||||||
|
elif 'test' in summary:
|
||||||
|
return 'test'
|
||||||
|
|
||||||
|
return 'util'
|
||||||
|
|
||||||
|
def _get_symbol_code(self, content: str, symbol: Symbol) -> str:
|
||||||
|
"""提取符号的代码"""
|
||||||
|
|
||||||
|
lines = content.splitlines()
|
||||||
|
start, end = symbol.range
|
||||||
|
return '\n'.join(lines[start-1:end])
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 成本优化统计
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class EnhancementStats:
|
||||||
|
"""增强统计"""
|
||||||
|
total_symbols: int = 0
|
||||||
|
used_docstring_only: int = 0 # 只使用docstring
|
||||||
|
llm_keywords_only: int = 0 # LLM只生成keywords
|
||||||
|
llm_refined: int = 0 # LLM精炼docstring
|
||||||
|
llm_full_generation: int = 0 # LLM完全生成
|
||||||
|
total_llm_calls: int = 0
|
||||||
|
estimated_cost_savings: float = 0.0 # 相比全用LLM节省的成本
|
||||||
|
|
||||||
|
class CostOptimizedEnhancer(HybridEnhancer):
|
||||||
|
"""带成本统计的增强器"""
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
self.stats = EnhancementStats()
|
||||||
|
|
||||||
|
def enhance_with_strategy(
|
||||||
|
self,
|
||||||
|
file_data: FileData,
|
||||||
|
symbols: List[Symbol]
|
||||||
|
) -> Dict[str, SemanticMetadata]:
|
||||||
|
"""增强并统计成本"""
|
||||||
|
|
||||||
|
self.stats.total_symbols += len(symbols)
|
||||||
|
results = super().enhance_with_strategy(file_data, symbols)
|
||||||
|
|
||||||
|
# 统计各策略使用情况
|
||||||
|
for metadata in results.values():
|
||||||
|
if metadata.llm_tool == "hybrid_docstring_primary":
|
||||||
|
self.stats.used_docstring_only += 1
|
||||||
|
self.stats.llm_keywords_only += 1
|
||||||
|
self.stats.total_llm_calls += 1
|
||||||
|
elif metadata.llm_tool == "hybrid_llm_refined":
|
||||||
|
self.stats.llm_refined += 1
|
||||||
|
self.stats.total_llm_calls += 1
|
||||||
|
elif metadata.llm_tool == "hybrid_llm_full":
|
||||||
|
self.stats.llm_full_generation += 1
|
||||||
|
self.stats.total_llm_calls += 1
|
||||||
|
|
||||||
|
# 计算成本节省(假设:keywords-only调用成本为full的20%)
|
||||||
|
keywords_only_savings = self.stats.llm_keywords_only * 0.8 # 节省80%
|
||||||
|
full_generation_count = self.stats.total_symbols - self.stats.llm_keywords_only
|
||||||
|
self.stats.estimated_cost_savings = keywords_only_savings / full_generation_count if full_generation_count > 0 else 0
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
def print_stats(self):
|
||||||
|
"""打印统计信息"""
|
||||||
|
|
||||||
|
print("=== Enhancement Statistics ===")
|
||||||
|
print(f"Total Symbols: {self.stats.total_symbols}")
|
||||||
|
print(f"Used Docstring (with LLM keywords): {self.stats.used_docstring_only} ({self.stats.used_docstring_only/self.stats.total_symbols*100:.1f}%)")
|
||||||
|
print(f"LLM Refined Docstring: {self.stats.llm_refined} ({self.stats.llm_refined/self.stats.total_symbols*100:.1f}%)")
|
||||||
|
print(f"LLM Full Generation: {self.stats.llm_full_generation} ({self.stats.llm_full_generation/self.stats.total_symbols*100:.1f}%)")
|
||||||
|
print(f"Total LLM Calls: {self.stats.total_llm_calls}")
|
||||||
|
print(f"Estimated Cost Savings: {self.stats.estimated_cost_savings*100:.1f}%")
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. 配置选项
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class HybridEnhancementConfig:
|
||||||
|
"""混合增强配置"""
|
||||||
|
|
||||||
|
# 是否启用混合策略(False则回退到全LLM模式)
|
||||||
|
enable_hybrid: bool = True
|
||||||
|
|
||||||
|
# 质量阈值配置
|
||||||
|
use_docstring_threshold: DocstringQuality = DocstringQuality.HIGH
|
||||||
|
refine_docstring_threshold: DocstringQuality = DocstringQuality.MEDIUM
|
||||||
|
|
||||||
|
# 是否为高质量docstring生成keywords
|
||||||
|
generate_keywords_for_docstring: bool = True
|
||||||
|
|
||||||
|
# LLM配置
|
||||||
|
llm_tool: str = "gemini"
|
||||||
|
llm_timeout: int = 300000
|
||||||
|
|
||||||
|
# 成本优化
|
||||||
|
batch_size: int = 5 # 批量处理大小
|
||||||
|
skip_test_files: bool = True # 跳过测试文件(通常docstring较少)
|
||||||
|
|
||||||
|
# 调试选项
|
||||||
|
log_strategy_decisions: bool = False # 记录策略决策日志
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. 测试策略
|
||||||
|
|
||||||
|
### 5.1 单元测试
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
class TestDocstringExtractor:
|
||||||
|
"""测试docstring提取"""
|
||||||
|
|
||||||
|
def test_extract_google_style(self):
|
||||||
|
"""测试Google风格docstring提取"""
|
||||||
|
code = '''
|
||||||
|
def calculate_total(items, discount=0):
|
||||||
|
"""Calculate total price with optional discount.
|
||||||
|
|
||||||
|
This function processes a list of items and applies
|
||||||
|
a discount if specified.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
items (list): List of item objects with price attribute.
|
||||||
|
discount (float): Discount percentage (0-1). Defaults to 0.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
float: Total price after discount.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
>>> calculate_total([item1, item2], discount=0.1)
|
||||||
|
90.0
|
||||||
|
"""
|
||||||
|
total = sum(item.price for item in items)
|
||||||
|
return total * (1 - discount)
|
||||||
|
'''
|
||||||
|
extractor = DocstringExtractor()
|
||||||
|
symbol = Symbol(name='calculate_total', kind='function', range=(1, 18))
|
||||||
|
docstring = extractor.extract_from_code(code, symbol)
|
||||||
|
|
||||||
|
assert docstring is not None
|
||||||
|
metadata = extractor.parse_docstring(docstring)
|
||||||
|
|
||||||
|
assert metadata.quality == DocstringQuality.HIGH
|
||||||
|
assert 'Calculate total price' in metadata.summary
|
||||||
|
assert metadata.parameters is not None
|
||||||
|
assert 'items' in metadata.parameters
|
||||||
|
assert metadata.returns is not None
|
||||||
|
assert metadata.examples is not None
|
||||||
|
|
||||||
|
def test_extract_low_quality_docstring(self):
|
||||||
|
"""测试低质量docstring识别"""
|
||||||
|
code = '''
|
||||||
|
def process():
|
||||||
|
"""TODO"""
|
||||||
|
pass
|
||||||
|
'''
|
||||||
|
extractor = DocstringExtractor()
|
||||||
|
symbol = Symbol(name='process', kind='function', range=(1, 3))
|
||||||
|
docstring = extractor.extract_from_code(code, symbol)
|
||||||
|
|
||||||
|
metadata = extractor.parse_docstring(docstring)
|
||||||
|
assert metadata.quality == DocstringQuality.LOW
|
||||||
|
|
||||||
|
class TestHybridEnhancer:
|
||||||
|
"""测试混合增强器"""
|
||||||
|
|
||||||
|
def test_high_quality_docstring_strategy(self):
|
||||||
|
"""测试高质量docstring使用策略"""
|
||||||
|
|
||||||
|
extractor = DocstringExtractor()
|
||||||
|
llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
|
||||||
|
hybrid = HybridEnhancer(llm_enhancer, extractor)
|
||||||
|
|
||||||
|
# 模拟高质量docstring
|
||||||
|
doc_metadata = DocstringMetadata(
|
||||||
|
raw_text="Validate user credentials against database.",
|
||||||
|
quality=DocstringQuality.HIGH,
|
||||||
|
summary="Validate user credentials against database."
|
||||||
|
)
|
||||||
|
|
||||||
|
symbol = Symbol(name='validate_user', kind='function', range=(1, 10))
|
||||||
|
|
||||||
|
result = hybrid._use_docstring_with_llm_keywords(symbol, doc_metadata)
|
||||||
|
|
||||||
|
# 应该使用docstring的摘要
|
||||||
|
assert result.summary == doc_metadata.summary
|
||||||
|
# 应该有keywords(可能由LLM或启发式生成)
|
||||||
|
assert len(result.keywords) > 0
|
||||||
|
|
||||||
|
def test_cost_optimization(self):
|
||||||
|
"""测试成本优化效果"""
|
||||||
|
|
||||||
|
enhancer = CostOptimizedEnhancer(
|
||||||
|
llm_enhancer=LLMEnhancer(LLMConfig(enabled=False)), # Mock
|
||||||
|
docstring_extractor=DocstringExtractor()
|
||||||
|
)
|
||||||
|
|
||||||
|
# 模拟处理10个symbol,其中5个有高质量docstring
|
||||||
|
# 预期:5个只调用keywords生成,5个完整LLM
|
||||||
|
# 总调用10次,但成本降低(keywords调用更便宜)
|
||||||
|
|
||||||
|
# 实际测试需要mock LLM调用
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 集成测试
|
||||||
|
|
||||||
|
```python
|
||||||
|
class TestHybridEnhancementPipeline:
|
||||||
|
"""测试完整的混合增强流程"""
|
||||||
|
|
||||||
|
def test_full_pipeline(self):
|
||||||
|
"""测试完整流程:代码 -> docstring提取 -> 质量评估 -> 策略选择 -> 增强"""
|
||||||
|
|
||||||
|
code = '''
|
||||||
|
def authenticate_user(username, password):
|
||||||
|
"""Authenticate user with username and password.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
username (str): User's username
|
||||||
|
password (str): User's password
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if authenticated, False otherwise
|
||||||
|
"""
|
||||||
|
# ... implementation
|
||||||
|
pass
|
||||||
|
|
||||||
|
def helper_func(x):
|
||||||
|
# No docstring
|
||||||
|
return x * 2
|
||||||
|
'''
|
||||||
|
|
||||||
|
file_data = FileData(path='auth.py', content=code, language='python')
|
||||||
|
symbols = [
|
||||||
|
Symbol(name='authenticate_user', kind='function', range=(1, 11)),
|
||||||
|
Symbol(name='helper_func', kind='function', range=(13, 15)),
|
||||||
|
]
|
||||||
|
|
||||||
|
extractor = DocstringExtractor()
|
||||||
|
llm_enhancer = LLMEnhancer(LLMConfig(enabled=True))
|
||||||
|
hybrid = CostOptimizedEnhancer(llm_enhancer, extractor)
|
||||||
|
|
||||||
|
results = hybrid.enhance_with_strategy(file_data, symbols)
|
||||||
|
|
||||||
|
# authenticate_user 应该使用docstring
|
||||||
|
assert results['authenticate_user'].llm_tool == "hybrid_docstring_primary"
|
||||||
|
|
||||||
|
# helper_func 应该完全LLM生成
|
||||||
|
assert results['helper_func'].llm_tool == "hybrid_llm_full"
|
||||||
|
|
||||||
|
# 统计
|
||||||
|
assert hybrid.stats.total_symbols == 2
|
||||||
|
assert hybrid.stats.used_docstring_only >= 1
|
||||||
|
assert hybrid.stats.llm_full_generation >= 1
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. 实施路线图
|
||||||
|
|
||||||
|
### Phase 1: 基础设施(1周)
|
||||||
|
- [x] 设计数据结构(DocstringMetadata, DocstringQuality)
|
||||||
|
- [ ] 实现DocstringExtractor(提取和解析)
|
||||||
|
- [ ] 支持Python docstring(Google/NumPy/reStructuredText风格)
|
||||||
|
- [ ] 单元测试
|
||||||
|
|
||||||
|
### Phase 2: 质量评估(1周)
|
||||||
|
- [ ] 实现质量评估算法
|
||||||
|
- [ ] 启发式规则优化
|
||||||
|
- [ ] 测试不同质量的docstring
|
||||||
|
- [ ] 调整阈值参数
|
||||||
|
|
||||||
|
### Phase 3: 混合策略(1-2周)
|
||||||
|
- [ ] 实现HybridEnhancer
|
||||||
|
- [ ] 三种策略实现(docstring-only, refine, full-llm)
|
||||||
|
- [ ] 策略选择逻辑
|
||||||
|
- [ ] 集成测试
|
||||||
|
|
||||||
|
### Phase 4: 成本优化(1周)
|
||||||
|
- [ ] 实现CostOptimizedEnhancer
|
||||||
|
- [ ] 统计和监控
|
||||||
|
- [ ] 批量处理优化
|
||||||
|
- [ ] 性能测试
|
||||||
|
|
||||||
|
### Phase 5: 多语言支持(1-2周)
|
||||||
|
- [ ] JavaScript/TypeScript JSDoc
|
||||||
|
- [ ] Java Javadoc
|
||||||
|
- [ ] 其他语言docstring格式
|
||||||
|
|
||||||
|
### Phase 6: 集成与部署(1周)
|
||||||
|
- [ ] 集成到现有llm_enhancer
|
||||||
|
- [ ] CLI选项暴露
|
||||||
|
- [ ] 配置文件支持
|
||||||
|
- [ ] 文档和示例
|
||||||
|
|
||||||
|
**总计预估时间**:6-8周
|
||||||
|
|
||||||
|
## 7. 性能与成本分析
|
||||||
|
|
||||||
|
### 7.1 预期成本节省
|
||||||
|
|
||||||
|
假设场景:分析1000个函数
|
||||||
|
|
||||||
|
| Docstring质量分布 | 占比 | LLM调用策略 | 相对成本 |
|
||||||
|
|------------------|------|------------|---------|
|
||||||
|
| High (有详细docstring) | 30% | 只生成keywords | 20% |
|
||||||
|
| Medium (有基本docstring) | 40% | 精炼增强 | 60% |
|
||||||
|
| Low/Missing | 30% | 完全生成 | 100% |
|
||||||
|
|
||||||
|
**总成本计算**:
|
||||||
|
- 纯LLM模式:1000 * 100% = 1000 units
|
||||||
|
- 混合模式:300*20% + 400*60% + 300*100% = 60 + 240 + 300 = 600 units
|
||||||
|
- **节省**:40%
|
||||||
|
|
||||||
|
### 7.2 质量对比
|
||||||
|
|
||||||
|
| 指标 | 纯LLM模式 | 混合模式 |
|
||||||
|
|------|----------|---------|
|
||||||
|
| 准确性 | 中(可能有幻觉) | **高**(docstring权威) |
|
||||||
|
| 一致性 | 中(依赖prompt) | **高**(保留作者风格) |
|
||||||
|
| 覆盖率 | **高**(全覆盖) | 高(98%+) |
|
||||||
|
| 成本 | 高 | **低**(节省40%) |
|
||||||
|
| 速度 | 慢(所有文件) | **快**(减少LLM调用) |
|
||||||
|
|
||||||
|
## 8. 潜在问题与解决方案
|
||||||
|
|
||||||
|
### 8.1 问题:Docstring过时
|
||||||
|
|
||||||
|
**现象**:代码已修改,但docstring未更新,导致信息不准确。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
```python
|
||||||
|
class DocstringFreshnessChecker:
|
||||||
|
"""检查docstring与代码的一致性"""
|
||||||
|
|
||||||
|
def check_freshness(
|
||||||
|
self,
|
||||||
|
symbol: Symbol,
|
||||||
|
code: str,
|
||||||
|
doc_metadata: DocstringMetadata
|
||||||
|
) -> bool:
|
||||||
|
"""检查docstring是否与代码匹配"""
|
||||||
|
|
||||||
|
# 检查1: 参数列表是否匹配
|
||||||
|
if doc_metadata.parameters:
|
||||||
|
actual_params = self._extract_actual_parameters(code)
|
||||||
|
documented_params = set(doc_metadata.parameters.keys())
|
||||||
|
|
||||||
|
if actual_params != documented_params:
|
||||||
|
logger.warning(
|
||||||
|
f"Parameter mismatch in {symbol.name}: "
|
||||||
|
f"code has {actual_params}, doc has {documented_params}"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
|
||||||
|
# 检查2: 使用LLM验证一致性
|
||||||
|
# TODO: 构建验证prompt
|
||||||
|
|
||||||
|
return True
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.2 问题:不同docstring风格混用
|
||||||
|
|
||||||
|
**现象**:同一项目中使用多种docstring风格(Google, NumPy, 自定义)。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
```python
|
||||||
|
class MultiStyleDocstringParser:
|
||||||
|
"""支持多种docstring风格的解析器"""
|
||||||
|
|
||||||
|
def parse(self, docstring: str) -> DocstringMetadata:
|
||||||
|
"""自动检测并解析不同风格"""
|
||||||
|
|
||||||
|
# 尝试各种解析器
|
||||||
|
for parser in [
|
||||||
|
GoogleStyleParser(),
|
||||||
|
NumpyStyleParser(),
|
||||||
|
ReStructuredTextParser(),
|
||||||
|
SimpleParser(), # Fallback
|
||||||
|
]:
|
||||||
|
try:
|
||||||
|
metadata = parser.parse(docstring)
|
||||||
|
if metadata.quality != DocstringQuality.LOW:
|
||||||
|
return metadata
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 如果所有解析器都失败,返回简单解析结果
|
||||||
|
return SimpleParser().parse(docstring)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.3 问题:多语言docstring提取差异
|
||||||
|
|
||||||
|
**现象**:不同语言的docstring格式和位置不同。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
```python
|
||||||
|
class LanguageSpecificExtractor:
|
||||||
|
"""语言特定的docstring提取器"""
|
||||||
|
|
||||||
|
def extract(self, language: str, code: str, symbol: Symbol) -> Optional[str]:
|
||||||
|
"""根据语言选择合适的提取器"""
|
||||||
|
|
||||||
|
extractors = {
|
||||||
|
'python': PythonDocstringExtractor(),
|
||||||
|
'javascript': JSDocExtractor(),
|
||||||
|
'typescript': TSDocExtractor(),
|
||||||
|
'java': JavadocExtractor(),
|
||||||
|
}
|
||||||
|
|
||||||
|
extractor = extractors.get(language, GenericExtractor())
|
||||||
|
return extractor.extract(code, symbol)
|
||||||
|
|
||||||
|
class JSDocExtractor:
|
||||||
|
"""JavaScript/TypeScript JSDoc提取器"""
|
||||||
|
|
||||||
|
def extract(self, code: str, symbol: Symbol) -> Optional[str]:
|
||||||
|
"""提取JSDoc注释"""
|
||||||
|
|
||||||
|
lines = code.splitlines()
|
||||||
|
start_line = symbol.range[0] - 1
|
||||||
|
|
||||||
|
# 向上查找 /** ... */ 注释
|
||||||
|
for i in range(start_line - 1, max(0, start_line - 20), -1):
|
||||||
|
if '*/' in lines[i]:
|
||||||
|
# 找到结束标记,向上提取
|
||||||
|
return self._extract_jsdoc_block(lines, i)
|
||||||
|
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
## 9. 配置示例
|
||||||
|
|
||||||
|
### 9.1 配置文件
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# .codexlens/hybrid_enhancement.yaml
|
||||||
|
|
||||||
|
hybrid_enhancement:
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
# 质量阈值
|
||||||
|
quality_thresholds:
|
||||||
|
use_docstring: high # high/medium/low
|
||||||
|
refine_docstring: medium
|
||||||
|
|
||||||
|
# LLM选项
|
||||||
|
llm:
|
||||||
|
tool: gemini
|
||||||
|
fallback: qwen
|
||||||
|
timeout_ms: 300000
|
||||||
|
batch_size: 5
|
||||||
|
|
||||||
|
# 成本优化
|
||||||
|
cost_optimization:
|
||||||
|
generate_keywords_for_docstring: true
|
||||||
|
skip_test_files: true
|
||||||
|
skip_private_methods: false
|
||||||
|
|
||||||
|
# 语言支持
|
||||||
|
languages:
|
||||||
|
python:
|
||||||
|
styles: [google, numpy, sphinx]
|
||||||
|
javascript:
|
||||||
|
styles: [jsdoc]
|
||||||
|
java:
|
||||||
|
styles: [javadoc]
|
||||||
|
|
||||||
|
# 监控
|
||||||
|
logging:
|
||||||
|
log_strategy_decisions: false
|
||||||
|
log_cost_savings: true
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9.2 CLI使用
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 使用混合策略增强
|
||||||
|
codex-lens enhance . --hybrid --tool gemini
|
||||||
|
|
||||||
|
# 查看成本统计
|
||||||
|
codex-lens enhance . --hybrid --show-stats
|
||||||
|
|
||||||
|
# 仅对高质量docstring生成keywords
|
||||||
|
codex-lens enhance . --hybrid --keywords-only
|
||||||
|
|
||||||
|
# 禁用混合模式,回退到纯LLM
|
||||||
|
codex-lens enhance . --no-hybrid --tool gemini
|
||||||
|
```
|
||||||
|
|
||||||
|
## 10. 成功指标
|
||||||
|
|
||||||
|
1. **成本节省**:相比纯LLM模式,降低API调用成本40%+
|
||||||
|
2. **准确性提升**:使用docstring的符号,元数据准确率>95%
|
||||||
|
3. **覆盖率**:98%+的符号有语义元数据(docstring或LLM生成)
|
||||||
|
4. **速度提升**:整体处理速度提升30%+(减少LLM调用)
|
||||||
|
5. **用户满意度**:保留docstring信息,开发者认可度高
|
||||||
|
|
||||||
|
## 11. 参考资料
|
||||||
|
|
||||||
|
- [PEP 257 - Docstring Conventions](https://peps.python.org/pep-0257/)
|
||||||
|
- [Google Python Style Guide - Docstrings](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings)
|
||||||
|
- [NumPy Docstring Standard](https://numpydoc.readthedocs.io/en/latest/format.html)
|
||||||
|
- [JSDoc Documentation](https://jsdoc.app/)
|
||||||
|
- [Javadoc Tool](https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html)
|
||||||
973
codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
Normal file
973
codex-lens/docs/MULTILEVEL_CHUNKER_DESIGN.md
Normal file
@@ -0,0 +1,973 @@
|
|||||||
|
# 多层次分词器设计方案
|
||||||
|
|
||||||
|
## 1. 背景与目标
|
||||||
|
|
||||||
|
### 1.1 当前问题
|
||||||
|
|
||||||
|
当前 `chunker.py` 的两种分词策略存在明显缺陷:
|
||||||
|
|
||||||
|
**symbol-based 策略**:
|
||||||
|
- ✅ 优点:保持代码逻辑完整性,每个chunk是完整的函数/类
|
||||||
|
- ❌ 缺点:粒度不均,超大函数可能达到数百行,影响LLM处理和搜索精度
|
||||||
|
|
||||||
|
**sliding-window 策略**:
|
||||||
|
- ✅ 优点:chunk大小均匀,覆盖全面
|
||||||
|
- ❌ 缺点:破坏逻辑结构,可能将完整的循环/条件块切断
|
||||||
|
|
||||||
|
### 1.2 设计目标
|
||||||
|
|
||||||
|
实现多层次分词器,同时满足:
|
||||||
|
1. **语义完整性**:保持代码逻辑边界的完整性
|
||||||
|
2. **粒度可控**:支持从粗粒度(函数级)到细粒度(逻辑块级)的灵活划分
|
||||||
|
3. **层级关系**:保留chunk之间的父子关系,支持上下文检索
|
||||||
|
4. **高效索引**:优化向量化和检索性能
|
||||||
|
|
||||||
|
## 2. 技术架构
|
||||||
|
|
||||||
|
### 2.1 两层分词架构
|
||||||
|
|
||||||
|
```
|
||||||
|
Source Code
|
||||||
|
↓
|
||||||
|
[Layer 1: Symbol-Level Chunking] ← 使用 tree-sitter AST
|
||||||
|
↓
|
||||||
|
MacroChunks (Functions/Classes)
|
||||||
|
↓
|
||||||
|
[Layer 2: Logic-Block Chunking] ← AST深度遍历
|
||||||
|
↓
|
||||||
|
MicroChunks (Loops/Conditionals/Blocks)
|
||||||
|
↓
|
||||||
|
Vector Embedding + Indexing
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 核心组件
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 新增数据结构
|
||||||
|
@dataclass
|
||||||
|
class ChunkMetadata:
|
||||||
|
"""Chunk元数据"""
|
||||||
|
chunk_id: str
|
||||||
|
parent_id: Optional[str] # 父chunk ID
|
||||||
|
level: int # 层级:1=macro, 2=micro
|
||||||
|
chunk_type: str # function/class/loop/conditional/try_except
|
||||||
|
file_path: str
|
||||||
|
start_line: int
|
||||||
|
end_line: int
|
||||||
|
symbol_name: Optional[str]
|
||||||
|
context_summary: Optional[str] # 继承自父chunk的上下文
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class HierarchicalChunk:
|
||||||
|
"""层级化的代码块"""
|
||||||
|
metadata: ChunkMetadata
|
||||||
|
content: str
|
||||||
|
embedding: Optional[List[float]] = None
|
||||||
|
children: List['HierarchicalChunk'] = field(default_factory=list)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. 详细实现步骤
|
||||||
|
|
||||||
|
### 3.1 第一层:符号级分词(Macro-Chunking)
|
||||||
|
|
||||||
|
**实现思路**:复用现有 `code_extractor.py` 逻辑,增强元数据提取。
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MacroChunker:
|
||||||
|
"""第一层分词器:提取顶层符号"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.parser = Parser()
|
||||||
|
# 加载语言grammar
|
||||||
|
|
||||||
|
def chunk_by_symbols(
|
||||||
|
self,
|
||||||
|
content: str,
|
||||||
|
file_path: str,
|
||||||
|
language: str
|
||||||
|
) -> List[HierarchicalChunk]:
|
||||||
|
"""提取顶层函数和类定义"""
|
||||||
|
tree = self.parser.parse(bytes(content, 'utf-8'))
|
||||||
|
root_node = tree.root_node
|
||||||
|
|
||||||
|
chunks = []
|
||||||
|
for node in root_node.children:
|
||||||
|
if node.type in ['function_definition', 'class_definition',
|
||||||
|
'method_definition']:
|
||||||
|
chunk = self._create_macro_chunk(node, content, file_path)
|
||||||
|
chunks.append(chunk)
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
|
||||||
|
def _create_macro_chunk(
|
||||||
|
self,
|
||||||
|
node,
|
||||||
|
content: str,
|
||||||
|
file_path: str
|
||||||
|
) -> HierarchicalChunk:
|
||||||
|
"""从AST节点创建macro chunk"""
|
||||||
|
start_line = node.start_point[0] + 1
|
||||||
|
end_line = node.end_point[0] + 1
|
||||||
|
|
||||||
|
# 提取符号名称
|
||||||
|
name_node = node.child_by_field_name('name')
|
||||||
|
symbol_name = content[name_node.start_byte:name_node.end_byte]
|
||||||
|
|
||||||
|
# 提取完整代码(包含docstring和装饰器)
|
||||||
|
chunk_content = self._extract_with_context(node, content)
|
||||||
|
|
||||||
|
metadata = ChunkMetadata(
|
||||||
|
chunk_id=f"{file_path}:{start_line}",
|
||||||
|
parent_id=None,
|
||||||
|
level=1,
|
||||||
|
chunk_type=node.type,
|
||||||
|
file_path=file_path,
|
||||||
|
start_line=start_line,
|
||||||
|
end_line=end_line,
|
||||||
|
symbol_name=symbol_name,
|
||||||
|
)
|
||||||
|
|
||||||
|
return HierarchicalChunk(
|
||||||
|
metadata=metadata,
|
||||||
|
content=chunk_content,
|
||||||
|
)
|
||||||
|
|
||||||
|
def _extract_with_context(self, node, content: str) -> str:
|
||||||
|
"""提取代码,包含装饰器和docstring"""
|
||||||
|
# 向上查找装饰器
|
||||||
|
start_byte = node.start_byte
|
||||||
|
prev_sibling = node.prev_sibling
|
||||||
|
while prev_sibling and prev_sibling.type == 'decorator':
|
||||||
|
start_byte = prev_sibling.start_byte
|
||||||
|
prev_sibling = prev_sibling.prev_sibling
|
||||||
|
|
||||||
|
return content[start_byte:node.end_byte]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 第二层:逻辑块分词(Micro-Chunking)
|
||||||
|
|
||||||
|
**实现思路**:在每个macro chunk内部,按逻辑结构进一步划分。
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MicroChunker:
|
||||||
|
"""第二层分词器:提取逻辑块"""
|
||||||
|
|
||||||
|
# 需要划分的逻辑块类型
|
||||||
|
LOGIC_BLOCK_TYPES = {
|
||||||
|
'for_statement',
|
||||||
|
'while_statement',
|
||||||
|
'if_statement',
|
||||||
|
'try_statement',
|
||||||
|
'with_statement',
|
||||||
|
}
|
||||||
|
|
||||||
|
def chunk_logic_blocks(
|
||||||
|
self,
|
||||||
|
macro_chunk: HierarchicalChunk,
|
||||||
|
content: str,
|
||||||
|
max_lines: int = 50 # 大于此行数的macro chunk才进行二次划分
|
||||||
|
) -> List[HierarchicalChunk]:
|
||||||
|
"""在macro chunk内部提取逻辑块"""
|
||||||
|
|
||||||
|
# 小函数不需要二次划分
|
||||||
|
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
|
||||||
|
if total_lines <= max_lines:
|
||||||
|
return []
|
||||||
|
|
||||||
|
tree = self.parser.parse(bytes(macro_chunk.content, 'utf-8'))
|
||||||
|
root_node = tree.root_node
|
||||||
|
|
||||||
|
micro_chunks = []
|
||||||
|
self._traverse_logic_blocks(
|
||||||
|
root_node,
|
||||||
|
macro_chunk,
|
||||||
|
content,
|
||||||
|
micro_chunks
|
||||||
|
)
|
||||||
|
|
||||||
|
return micro_chunks
|
||||||
|
|
||||||
|
def _traverse_logic_blocks(
|
||||||
|
self,
|
||||||
|
node,
|
||||||
|
parent_chunk: HierarchicalChunk,
|
||||||
|
content: str,
|
||||||
|
result: List[HierarchicalChunk]
|
||||||
|
):
|
||||||
|
"""递归遍历AST,提取逻辑块"""
|
||||||
|
|
||||||
|
if node.type in self.LOGIC_BLOCK_TYPES:
|
||||||
|
micro_chunk = self._create_micro_chunk(
|
||||||
|
node,
|
||||||
|
parent_chunk,
|
||||||
|
content
|
||||||
|
)
|
||||||
|
result.append(micro_chunk)
|
||||||
|
parent_chunk.children.append(micro_chunk)
|
||||||
|
|
||||||
|
# 继续遍历子节点
|
||||||
|
for child in node.children:
|
||||||
|
self._traverse_logic_blocks(child, parent_chunk, content, result)
|
||||||
|
|
||||||
|
def _create_micro_chunk(
|
||||||
|
self,
|
||||||
|
node,
|
||||||
|
parent_chunk: HierarchicalChunk,
|
||||||
|
content: str
|
||||||
|
) -> HierarchicalChunk:
|
||||||
|
"""创建micro chunk"""
|
||||||
|
|
||||||
|
# 计算相对于文件的行号
|
||||||
|
start_line = parent_chunk.metadata.start_line + node.start_point[0]
|
||||||
|
end_line = parent_chunk.metadata.start_line + node.end_point[0]
|
||||||
|
|
||||||
|
chunk_content = content[node.start_byte:node.end_byte]
|
||||||
|
|
||||||
|
metadata = ChunkMetadata(
|
||||||
|
chunk_id=f"{parent_chunk.metadata.chunk_id}:L{start_line}",
|
||||||
|
parent_id=parent_chunk.metadata.chunk_id,
|
||||||
|
level=2,
|
||||||
|
chunk_type=node.type,
|
||||||
|
file_path=parent_chunk.metadata.file_path,
|
||||||
|
start_line=start_line,
|
||||||
|
end_line=end_line,
|
||||||
|
symbol_name=parent_chunk.metadata.symbol_name, # 继承父符号名
|
||||||
|
context_summary=None, # 后续由LLM填充
|
||||||
|
)
|
||||||
|
|
||||||
|
return HierarchicalChunk(
|
||||||
|
metadata=metadata,
|
||||||
|
content=chunk_content,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 统一接口:多层次分词器
|
||||||
|
|
||||||
|
```python
|
||||||
|
class HierarchicalChunker:
|
||||||
|
"""多层次分词器统一接口"""
|
||||||
|
|
||||||
|
def __init__(self, config: ChunkConfig = None):
|
||||||
|
self.config = config or ChunkConfig()
|
||||||
|
self.macro_chunker = MacroChunker()
|
||||||
|
self.micro_chunker = MicroChunker()
|
||||||
|
|
||||||
|
def chunk_file(
|
||||||
|
self,
|
||||||
|
content: str,
|
||||||
|
file_path: str,
|
||||||
|
language: str
|
||||||
|
) -> List[HierarchicalChunk]:
|
||||||
|
"""对文件进行多层次分词"""
|
||||||
|
|
||||||
|
# 第一层:符号级分词
|
||||||
|
macro_chunks = self.macro_chunker.chunk_by_symbols(
|
||||||
|
content, file_path, language
|
||||||
|
)
|
||||||
|
|
||||||
|
# 第二层:逻辑块分词
|
||||||
|
all_chunks = []
|
||||||
|
for macro_chunk in macro_chunks:
|
||||||
|
all_chunks.append(macro_chunk)
|
||||||
|
|
||||||
|
# 对大函数进行二次划分
|
||||||
|
micro_chunks = self.micro_chunker.chunk_logic_blocks(
|
||||||
|
macro_chunk, content
|
||||||
|
)
|
||||||
|
all_chunks.extend(micro_chunks)
|
||||||
|
|
||||||
|
return all_chunks
|
||||||
|
|
||||||
|
def chunk_file_with_fallback(
|
||||||
|
self,
|
||||||
|
content: str,
|
||||||
|
file_path: str,
|
||||||
|
language: str
|
||||||
|
) -> List[HierarchicalChunk]:
|
||||||
|
"""带降级策略的分词"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
return self.chunk_file(content, file_path, language)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Hierarchical chunking failed: {e}, falling back to sliding window")
|
||||||
|
# 降级到滑动窗口策略
|
||||||
|
return self._fallback_sliding_window(content, file_path, language)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. 数据存储设计
|
||||||
|
|
||||||
|
### 4.1 数据库Schema
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- chunk表:存储所有层级的chunk
|
||||||
|
CREATE TABLE chunks (
|
||||||
|
chunk_id TEXT PRIMARY KEY,
|
||||||
|
parent_id TEXT, -- 父chunk ID,NULL表示顶层
|
||||||
|
level INTEGER NOT NULL, -- 1=macro, 2=micro
|
||||||
|
chunk_type TEXT NOT NULL, -- function/class/loop/if/try等
|
||||||
|
file_path TEXT NOT NULL,
|
||||||
|
start_line INTEGER NOT NULL,
|
||||||
|
end_line INTEGER NOT NULL,
|
||||||
|
symbol_name TEXT,
|
||||||
|
content TEXT NOT NULL,
|
||||||
|
content_hash TEXT, -- 用于检测内容变化
|
||||||
|
|
||||||
|
-- 语义元数据(由LLM生成)
|
||||||
|
summary TEXT,
|
||||||
|
keywords TEXT, -- JSON数组
|
||||||
|
purpose TEXT,
|
||||||
|
|
||||||
|
-- 向量嵌入
|
||||||
|
embedding BLOB, -- 存储向量
|
||||||
|
|
||||||
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||||
|
|
||||||
|
FOREIGN KEY (parent_id) REFERENCES chunks(chunk_id) ON DELETE CASCADE
|
||||||
|
);
|
||||||
|
|
||||||
|
-- 索引优化
|
||||||
|
CREATE INDEX idx_chunks_file_path ON chunks(file_path);
|
||||||
|
CREATE INDEX idx_chunks_parent_id ON chunks(parent_id);
|
||||||
|
CREATE INDEX idx_chunks_level ON chunks(level);
|
||||||
|
CREATE INDEX idx_chunks_symbol_name ON chunks(symbol_name);
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 向量索引
|
||||||
|
|
||||||
|
使用分层索引策略:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class HierarchicalVectorStore:
|
||||||
|
"""层级化向量存储"""
|
||||||
|
|
||||||
|
def __init__(self, db_path: Path):
|
||||||
|
self.db_path = db_path
|
||||||
|
self.conn = sqlite3.connect(db_path)
|
||||||
|
|
||||||
|
def add_chunk(self, chunk: HierarchicalChunk):
|
||||||
|
"""添加chunk及其向量"""
|
||||||
|
|
||||||
|
cursor = self.conn.cursor()
|
||||||
|
cursor.execute("""
|
||||||
|
INSERT INTO chunks (
|
||||||
|
chunk_id, parent_id, level, chunk_type,
|
||||||
|
file_path, start_line, end_line, symbol_name,
|
||||||
|
content, embedding
|
||||||
|
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
|
""", (
|
||||||
|
chunk.metadata.chunk_id,
|
||||||
|
chunk.metadata.parent_id,
|
||||||
|
chunk.metadata.level,
|
||||||
|
chunk.metadata.chunk_type,
|
||||||
|
chunk.metadata.file_path,
|
||||||
|
chunk.metadata.start_line,
|
||||||
|
chunk.metadata.end_line,
|
||||||
|
chunk.metadata.symbol_name,
|
||||||
|
chunk.content,
|
||||||
|
self._serialize_embedding(chunk.embedding),
|
||||||
|
))
|
||||||
|
|
||||||
|
self.conn.commit()
|
||||||
|
|
||||||
|
def search_hierarchical(
|
||||||
|
self,
|
||||||
|
query_embedding: List[float],
|
||||||
|
top_k: int = 10,
|
||||||
|
level_weights: Dict[int, float] = None
|
||||||
|
) -> List[Tuple[HierarchicalChunk, float]]:
|
||||||
|
"""层级化检索"""
|
||||||
|
|
||||||
|
# 默认权重:macro chunk权重更高
|
||||||
|
if level_weights is None:
|
||||||
|
level_weights = {1: 1.0, 2: 0.8}
|
||||||
|
|
||||||
|
# 检索所有chunk
|
||||||
|
cursor = self.conn.cursor()
|
||||||
|
cursor.execute("SELECT * FROM chunks WHERE embedding IS NOT NULL")
|
||||||
|
|
||||||
|
results = []
|
||||||
|
for row in cursor.fetchall():
|
||||||
|
chunk = self._row_to_chunk(row)
|
||||||
|
similarity = self._cosine_similarity(
|
||||||
|
query_embedding,
|
||||||
|
chunk.embedding
|
||||||
|
)
|
||||||
|
|
||||||
|
# 根据层级应用权重
|
||||||
|
weighted_score = similarity * level_weights.get(chunk.metadata.level, 1.0)
|
||||||
|
results.append((chunk, weighted_score))
|
||||||
|
|
||||||
|
# 按分数排序
|
||||||
|
results.sort(key=lambda x: x[1], reverse=True)
|
||||||
|
return results[:top_k]
|
||||||
|
|
||||||
|
def get_chunk_with_context(
|
||||||
|
self,
|
||||||
|
chunk_id: str
|
||||||
|
) -> Tuple[HierarchicalChunk, Optional[HierarchicalChunk]]:
|
||||||
|
"""获取chunk及其父chunk(提供上下文)"""
|
||||||
|
|
||||||
|
cursor = self.conn.cursor()
|
||||||
|
|
||||||
|
# 获取chunk本身
|
||||||
|
cursor.execute("SELECT * FROM chunks WHERE chunk_id = ?", (chunk_id,))
|
||||||
|
chunk_row = cursor.fetchone()
|
||||||
|
chunk = self._row_to_chunk(chunk_row)
|
||||||
|
|
||||||
|
# 获取父chunk
|
||||||
|
parent = None
|
||||||
|
if chunk.metadata.parent_id:
|
||||||
|
cursor.execute(
|
||||||
|
"SELECT * FROM chunks WHERE chunk_id = ?",
|
||||||
|
(chunk.metadata.parent_id,)
|
||||||
|
)
|
||||||
|
parent_row = cursor.fetchone()
|
||||||
|
if parent_row:
|
||||||
|
parent = self._row_to_chunk(parent_row)
|
||||||
|
|
||||||
|
return chunk, parent
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. LLM集成策略
|
||||||
|
|
||||||
|
### 5.1 分层生成语义元数据
|
||||||
|
|
||||||
|
```python
|
||||||
|
class HierarchicalLLMEnhancer:
|
||||||
|
"""为层级chunk生成语义元数据"""
|
||||||
|
|
||||||
|
def enhance_hierarchical_chunks(
|
||||||
|
self,
|
||||||
|
chunks: List[HierarchicalChunk]
|
||||||
|
) -> Dict[str, SemanticMetadata]:
|
||||||
|
"""
|
||||||
|
分层处理策略:
|
||||||
|
1. 先处理所有level=1的macro chunks,生成详细摘要
|
||||||
|
2. 再处理level=2的micro chunks,使用父chunk摘要作为上下文
|
||||||
|
"""
|
||||||
|
|
||||||
|
results = {}
|
||||||
|
|
||||||
|
# 第一轮:处理macro chunks
|
||||||
|
macro_chunks = [c for c in chunks if c.metadata.level == 1]
|
||||||
|
macro_metadata = self.llm_enhancer.enhance_files([
|
||||||
|
FileData(
|
||||||
|
path=c.metadata.chunk_id,
|
||||||
|
content=c.content,
|
||||||
|
language=self._detect_language(c.metadata.file_path)
|
||||||
|
)
|
||||||
|
for c in macro_chunks
|
||||||
|
])
|
||||||
|
results.update(macro_metadata)
|
||||||
|
|
||||||
|
# 第二轮:处理micro chunks(带父上下文)
|
||||||
|
micro_chunks = [c for c in chunks if c.metadata.level == 2]
|
||||||
|
for micro_chunk in micro_chunks:
|
||||||
|
parent_id = micro_chunk.metadata.parent_id
|
||||||
|
parent_summary = macro_metadata.get(parent_id, {}).get('summary', '')
|
||||||
|
|
||||||
|
# 构建带上下文的prompt
|
||||||
|
enhanced_prompt = f"""
|
||||||
|
Parent Function: {micro_chunk.metadata.symbol_name}
|
||||||
|
Parent Summary: {parent_summary}
|
||||||
|
|
||||||
|
Code Block ({micro_chunk.metadata.chunk_type}):
|
||||||
|
```
|
||||||
|
{micro_chunk.content}
|
||||||
|
```
|
||||||
|
|
||||||
|
Generate a concise summary (1 sentence) and keywords for this specific code block.
|
||||||
|
"""
|
||||||
|
|
||||||
|
metadata = self._call_llm_with_context(enhanced_prompt)
|
||||||
|
results[micro_chunk.metadata.chunk_id] = metadata
|
||||||
|
|
||||||
|
return results
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Prompt优化
|
||||||
|
|
||||||
|
针对不同层级使用不同的prompt模板:
|
||||||
|
|
||||||
|
**Macro Chunk Prompt (Level 1)**:
|
||||||
|
```
|
||||||
|
PURPOSE: Generate comprehensive semantic metadata for a complete function/class
|
||||||
|
TASK:
|
||||||
|
- Provide a detailed summary (2-3 sentences) covering what the code does and why
|
||||||
|
- Extract 8-12 relevant keywords including technical terms and domain concepts
|
||||||
|
- Identify the primary purpose/category
|
||||||
|
MODE: analysis
|
||||||
|
|
||||||
|
CODE:
|
||||||
|
```{language}
|
||||||
|
{content}
|
||||||
|
```
|
||||||
|
|
||||||
|
OUTPUT: JSON with summary, keywords, purpose
|
||||||
|
```
|
||||||
|
|
||||||
|
**Micro Chunk Prompt (Level 2)**:
|
||||||
|
```
|
||||||
|
PURPOSE: Summarize a specific logic block within a larger function
|
||||||
|
CONTEXT:
|
||||||
|
- Parent Function: {symbol_name}
|
||||||
|
- Parent Purpose: {parent_summary}
|
||||||
|
|
||||||
|
TASK:
|
||||||
|
- Provide a brief summary (1 sentence) of this specific block's role in the parent function
|
||||||
|
- Extract 3-5 keywords specific to this block's logic
|
||||||
|
MODE: analysis
|
||||||
|
|
||||||
|
CODE BLOCK ({chunk_type}):
|
||||||
|
```{language}
|
||||||
|
{content}
|
||||||
|
```
|
||||||
|
|
||||||
|
OUTPUT: JSON with summary, keywords
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. 检索增强
|
||||||
|
|
||||||
|
### 6.1 上下文扩展检索
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ContextualSearchEngine:
|
||||||
|
"""支持上下文扩展的检索引擎"""
|
||||||
|
|
||||||
|
def search_with_context(
|
||||||
|
self,
|
||||||
|
query: str,
|
||||||
|
top_k: int = 10,
|
||||||
|
expand_context: bool = True
|
||||||
|
) -> List[SearchResult]:
|
||||||
|
"""
|
||||||
|
检索并自动扩展上下文
|
||||||
|
|
||||||
|
如果匹配到micro chunk,自动返回其父macro chunk作为上下文
|
||||||
|
"""
|
||||||
|
|
||||||
|
# 生成查询向量
|
||||||
|
query_embedding = self.embedder.embed_single(query)
|
||||||
|
|
||||||
|
# 层级化检索
|
||||||
|
raw_results = self.vector_store.search_hierarchical(
|
||||||
|
query_embedding,
|
||||||
|
top_k=top_k
|
||||||
|
)
|
||||||
|
|
||||||
|
# 扩展上下文
|
||||||
|
enriched_results = []
|
||||||
|
for chunk, score in raw_results:
|
||||||
|
result = SearchResult(
|
||||||
|
path=chunk.metadata.file_path,
|
||||||
|
score=score,
|
||||||
|
content=chunk.content,
|
||||||
|
start_line=chunk.metadata.start_line,
|
||||||
|
end_line=chunk.metadata.end_line,
|
||||||
|
symbol_name=chunk.metadata.symbol_name,
|
||||||
|
)
|
||||||
|
|
||||||
|
# 如果是micro chunk,获取父chunk作为上下文
|
||||||
|
if expand_context and chunk.metadata.level == 2:
|
||||||
|
parent_chunk, _ = self.vector_store.get_chunk_with_context(
|
||||||
|
chunk.metadata.chunk_id
|
||||||
|
)
|
||||||
|
if parent_chunk:
|
||||||
|
result.metadata['parent_context'] = {
|
||||||
|
'summary': parent_chunk.metadata.context_summary,
|
||||||
|
'symbol_name': parent_chunk.metadata.symbol_name,
|
||||||
|
'content': parent_chunk.content,
|
||||||
|
}
|
||||||
|
|
||||||
|
enriched_results.append(result)
|
||||||
|
|
||||||
|
return enriched_results
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7. 测试策略
|
||||||
|
|
||||||
|
### 7.1 单元测试
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pytest
|
||||||
|
from codexlens.semantic.hierarchical_chunker import (
|
||||||
|
HierarchicalChunker, MacroChunker, MicroChunker
|
||||||
|
)
|
||||||
|
|
||||||
|
class TestMacroChunker:
|
||||||
|
"""测试第一层分词"""
|
||||||
|
|
||||||
|
def test_extract_functions(self):
|
||||||
|
"""测试提取函数定义"""
|
||||||
|
code = '''
|
||||||
|
def calculate_total(items):
|
||||||
|
"""Calculate total price."""
|
||||||
|
total = 0
|
||||||
|
for item in items:
|
||||||
|
total += item.price
|
||||||
|
return total
|
||||||
|
|
||||||
|
def apply_discount(total, discount):
|
||||||
|
"""Apply discount to total."""
|
||||||
|
return total * (1 - discount)
|
||||||
|
'''
|
||||||
|
chunker = MacroChunker()
|
||||||
|
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||||||
|
|
||||||
|
assert len(chunks) == 2
|
||||||
|
assert chunks[0].metadata.symbol_name == 'calculate_total'
|
||||||
|
assert chunks[1].metadata.symbol_name == 'apply_discount'
|
||||||
|
assert chunks[0].metadata.level == 1
|
||||||
|
|
||||||
|
def test_extract_with_decorators(self):
|
||||||
|
"""测试提取带装饰器的函数"""
|
||||||
|
code = '''
|
||||||
|
@app.route('/api/users')
|
||||||
|
@auth_required
|
||||||
|
def get_users():
|
||||||
|
return User.query.all()
|
||||||
|
'''
|
||||||
|
chunker = MacroChunker()
|
||||||
|
chunks = chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||||||
|
|
||||||
|
assert len(chunks) == 1
|
||||||
|
assert '@app.route' in chunks[0].content
|
||||||
|
assert '@auth_required' in chunks[0].content
|
||||||
|
|
||||||
|
class TestMicroChunker:
|
||||||
|
"""测试第二层分词"""
|
||||||
|
|
||||||
|
def test_extract_loop_blocks(self):
|
||||||
|
"""测试提取循环块"""
|
||||||
|
code = '''
|
||||||
|
def process_items(items):
|
||||||
|
results = []
|
||||||
|
for item in items:
|
||||||
|
if item.active:
|
||||||
|
results.append(process(item))
|
||||||
|
return results
|
||||||
|
'''
|
||||||
|
macro_chunker = MacroChunker()
|
||||||
|
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||||||
|
|
||||||
|
micro_chunker = MicroChunker()
|
||||||
|
micro_chunks = micro_chunker.chunk_logic_blocks(
|
||||||
|
macro_chunks[0], code
|
||||||
|
)
|
||||||
|
|
||||||
|
# 应该提取出for循环和if条件块
|
||||||
|
assert len(micro_chunks) >= 1
|
||||||
|
assert any(c.metadata.chunk_type == 'for_statement' for c in micro_chunks)
|
||||||
|
|
||||||
|
def test_skip_small_functions(self):
|
||||||
|
"""测试小函数跳过二次划分"""
|
||||||
|
code = '''
|
||||||
|
def small_func(x):
|
||||||
|
return x * 2
|
||||||
|
'''
|
||||||
|
macro_chunker = MacroChunker()
|
||||||
|
macro_chunks = macro_chunker.chunk_by_symbols(code, 'test.py', 'python')
|
||||||
|
|
||||||
|
micro_chunker = MicroChunker()
|
||||||
|
micro_chunks = micro_chunker.chunk_logic_blocks(
|
||||||
|
macro_chunks[0], code, max_lines=10
|
||||||
|
)
|
||||||
|
|
||||||
|
# 小函数不应该被二次划分
|
||||||
|
assert len(micro_chunks) == 0
|
||||||
|
|
||||||
|
class TestHierarchicalChunker:
|
||||||
|
"""测试完整的多层次分词"""
|
||||||
|
|
||||||
|
def test_full_hierarchical_chunking(self):
|
||||||
|
"""测试完整的层级分词流程"""
|
||||||
|
code = '''
|
||||||
|
def complex_function(data):
|
||||||
|
"""A complex function with multiple logic blocks."""
|
||||||
|
|
||||||
|
# Validation
|
||||||
|
if not data:
|
||||||
|
raise ValueError("Data is empty")
|
||||||
|
|
||||||
|
# Processing
|
||||||
|
results = []
|
||||||
|
for item in data:
|
||||||
|
try:
|
||||||
|
processed = process_item(item)
|
||||||
|
results.append(processed)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to process: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Aggregation
|
||||||
|
total = sum(r.value for r in results)
|
||||||
|
return total
|
||||||
|
'''
|
||||||
|
chunker = HierarchicalChunker()
|
||||||
|
chunks = chunker.chunk_file(code, 'test.py', 'python')
|
||||||
|
|
||||||
|
# 应该有1个macro chunk和多个micro chunks
|
||||||
|
macro_chunks = [c for c in chunks if c.metadata.level == 1]
|
||||||
|
micro_chunks = [c for c in chunks if c.metadata.level == 2]
|
||||||
|
|
||||||
|
assert len(macro_chunks) == 1
|
||||||
|
assert len(micro_chunks) > 0
|
||||||
|
|
||||||
|
# 验证父子关系
|
||||||
|
for micro in micro_chunks:
|
||||||
|
assert micro.metadata.parent_id == macro_chunks[0].metadata.chunk_id
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 集成测试
|
||||||
|
|
||||||
|
```python
|
||||||
|
class TestHierarchicalIndexing:
|
||||||
|
"""测试完整的索引流程"""
|
||||||
|
|
||||||
|
def test_index_and_search(self):
|
||||||
|
"""测试分层索引和检索"""
|
||||||
|
|
||||||
|
# 1. 分词
|
||||||
|
chunker = HierarchicalChunker()
|
||||||
|
chunks = chunker.chunk_file(sample_code, 'sample.py', 'python')
|
||||||
|
|
||||||
|
# 2. LLM增强
|
||||||
|
enhancer = HierarchicalLLMEnhancer()
|
||||||
|
metadata = enhancer.enhance_hierarchical_chunks(chunks)
|
||||||
|
|
||||||
|
# 3. 向量化
|
||||||
|
embedder = Embedder()
|
||||||
|
for chunk in chunks:
|
||||||
|
text = metadata[chunk.metadata.chunk_id].summary
|
||||||
|
chunk.embedding = embedder.embed_single(text)
|
||||||
|
|
||||||
|
# 4. 存储
|
||||||
|
vector_store = HierarchicalVectorStore(Path('/tmp/test.db'))
|
||||||
|
for chunk in chunks:
|
||||||
|
vector_store.add_chunk(chunk)
|
||||||
|
|
||||||
|
# 5. 检索
|
||||||
|
search_engine = ContextualSearchEngine(vector_store, embedder)
|
||||||
|
results = search_engine.search_with_context(
|
||||||
|
"find loop that processes items",
|
||||||
|
top_k=5
|
||||||
|
)
|
||||||
|
|
||||||
|
# 验证结果
|
||||||
|
assert len(results) > 0
|
||||||
|
assert any(r.metadata.get('parent_context') for r in results)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 8. 性能优化
|
||||||
|
|
||||||
|
### 8.1 批量处理
|
||||||
|
|
||||||
|
```python
|
||||||
|
class BatchHierarchicalProcessor:
|
||||||
|
"""批量处理多个文件的层级分词"""
|
||||||
|
|
||||||
|
def process_files_batch(
|
||||||
|
self,
|
||||||
|
file_paths: List[Path],
|
||||||
|
batch_size: int = 10
|
||||||
|
):
|
||||||
|
"""批量处理,优化LLM调用"""
|
||||||
|
|
||||||
|
all_chunks = []
|
||||||
|
|
||||||
|
# 1. 批量分词
|
||||||
|
for file_path in file_paths:
|
||||||
|
content = file_path.read_text()
|
||||||
|
chunks = self.chunker.chunk_file(
|
||||||
|
content, str(file_path), self._detect_language(file_path)
|
||||||
|
)
|
||||||
|
all_chunks.extend(chunks)
|
||||||
|
|
||||||
|
# 2. 批量LLM增强(减少API调用)
|
||||||
|
macro_chunks = [c for c in all_chunks if c.metadata.level == 1]
|
||||||
|
for i in range(0, len(macro_chunks), batch_size):
|
||||||
|
batch = macro_chunks[i:i+batch_size]
|
||||||
|
self.enhancer.enhance_batch(batch)
|
||||||
|
|
||||||
|
# 3. 批量向量化
|
||||||
|
all_texts = [c.content for c in all_chunks]
|
||||||
|
embeddings = self.embedder.embed_batch(all_texts)
|
||||||
|
for chunk, embedding in zip(all_chunks, embeddings):
|
||||||
|
chunk.embedding = embedding
|
||||||
|
|
||||||
|
# 4. 批量存储
|
||||||
|
self.vector_store.add_chunks_batch(all_chunks)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.2 增量更新
|
||||||
|
|
||||||
|
```python
|
||||||
|
class IncrementalIndexer:
|
||||||
|
"""增量索引器:只处理变化的文件"""
|
||||||
|
|
||||||
|
def update_file(self, file_path: Path):
|
||||||
|
"""增量更新单个文件"""
|
||||||
|
|
||||||
|
content = file_path.read_text()
|
||||||
|
content_hash = hashlib.sha256(content.encode()).hexdigest()
|
||||||
|
|
||||||
|
# 检查文件是否变化
|
||||||
|
cursor = self.conn.cursor()
|
||||||
|
cursor.execute("""
|
||||||
|
SELECT content_hash FROM chunks
|
||||||
|
WHERE file_path = ? AND level = 1
|
||||||
|
LIMIT 1
|
||||||
|
""", (str(file_path),))
|
||||||
|
|
||||||
|
row = cursor.fetchone()
|
||||||
|
if row and row[0] == content_hash:
|
||||||
|
logger.info(f"File {file_path} unchanged, skipping")
|
||||||
|
return
|
||||||
|
|
||||||
|
# 删除旧chunk
|
||||||
|
cursor.execute("DELETE FROM chunks WHERE file_path = ?", (str(file_path),))
|
||||||
|
|
||||||
|
# 重新索引
|
||||||
|
chunks = self.chunker.chunk_file(content, str(file_path), 'python')
|
||||||
|
# ... 继续处理
|
||||||
|
```
|
||||||
|
|
||||||
|
## 9. 潜在问题与解决方案
|
||||||
|
|
||||||
|
### 9.1 问题:超大函数的micro chunk过多
|
||||||
|
|
||||||
|
**现象**:某些遗留代码函数超过1000行,可能产生几十个micro chunks。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
```python
|
||||||
|
class AdaptiveMicroChunker:
|
||||||
|
"""自适应micro分词:根据函数大小调整策略"""
|
||||||
|
|
||||||
|
def chunk_logic_blocks(self, macro_chunk, content):
|
||||||
|
total_lines = macro_chunk.metadata.end_line - macro_chunk.metadata.start_line
|
||||||
|
|
||||||
|
if total_lines > 500:
|
||||||
|
# 超大函数:只提取顶层逻辑块,不递归
|
||||||
|
return self._extract_top_level_blocks(macro_chunk, content)
|
||||||
|
elif total_lines > 100:
|
||||||
|
# 大函数:递归深度限制为2层
|
||||||
|
return self._extract_blocks_with_depth_limit(macro_chunk, content, max_depth=2)
|
||||||
|
else:
|
||||||
|
# 正常函数:完全跳过micro chunking
|
||||||
|
return []
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9.2 问题:tree-sitter解析失败
|
||||||
|
|
||||||
|
**现象**:对于语法错误的代码,tree-sitter解析可能失败。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
```python
|
||||||
|
def chunk_file_with_fallback(self, content, file_path, language):
|
||||||
|
"""带降级策略的分词"""
|
||||||
|
|
||||||
|
try:
|
||||||
|
# 尝试层级分词
|
||||||
|
return self.chunk_file(content, file_path, language)
|
||||||
|
except TreeSitterError as e:
|
||||||
|
logger.warning(f"Tree-sitter parsing failed: {e}")
|
||||||
|
|
||||||
|
# 降级到基于正则的简单symbol提取
|
||||||
|
return self._fallback_regex_chunking(content, file_path)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Chunking failed completely: {e}")
|
||||||
|
|
||||||
|
# 最终降级到滑动窗口
|
||||||
|
return self._fallback_sliding_window(content, file_path, language)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9.3 问题:向量存储空间占用
|
||||||
|
|
||||||
|
**现象**:每个chunk都存储向量,空间占用可能很大。
|
||||||
|
|
||||||
|
**解决方案**:
|
||||||
|
- **选择性向量化**:只对macro chunks和重要的micro chunks生成向量
|
||||||
|
- **向量压缩**:使用PCA或量化技术减少向量维度
|
||||||
|
- **分离存储**:向量存储在专门的向量数据库(如Faiss),SQLite只存元数据
|
||||||
|
|
||||||
|
```python
|
||||||
|
class SelectiveVectorization:
|
||||||
|
"""选择性向量化:减少存储开销"""
|
||||||
|
|
||||||
|
VECTORIZE_CHUNK_TYPES = {
|
||||||
|
'function_definition', # 总是向量化
|
||||||
|
'class_definition', # 总是向量化
|
||||||
|
'for_statement', # 循环块
|
||||||
|
'try_statement', # 异常处理
|
||||||
|
# 'if_statement' 通常不单独向量化,依赖父chunk
|
||||||
|
}
|
||||||
|
|
||||||
|
def should_vectorize(self, chunk: HierarchicalChunk) -> bool:
|
||||||
|
"""判断是否需要为chunk生成向量"""
|
||||||
|
|
||||||
|
# Level 1总是向量化
|
||||||
|
if chunk.metadata.level == 1:
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Level 2根据类型和大小决定
|
||||||
|
if chunk.metadata.chunk_type not in self.VECTORIZE_CHUNK_TYPES:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# 太小的块(<5行)不向量化
|
||||||
|
lines = chunk.metadata.end_line - chunk.metadata.start_line
|
||||||
|
if lines < 5:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return True
|
||||||
|
```
|
||||||
|
|
||||||
|
## 10. 实施路线图
|
||||||
|
|
||||||
|
### Phase 1: 基础架构(2-3周)
|
||||||
|
- [x] 设计数据结构(HierarchicalChunk, ChunkMetadata)
|
||||||
|
- [ ] 实现MacroChunker(复用现有code_extractor)
|
||||||
|
- [ ] 实现基础的MicroChunker
|
||||||
|
- [ ] 数据库schema设计和migration
|
||||||
|
- [ ] 单元测试
|
||||||
|
|
||||||
|
### Phase 2: LLM集成(1-2周)
|
||||||
|
- [ ] 实现HierarchicalLLMEnhancer
|
||||||
|
- [ ] 设计分层prompt模板
|
||||||
|
- [ ] 批量处理优化
|
||||||
|
- [ ] 集成测试
|
||||||
|
|
||||||
|
### Phase 3: 向量化与检索(1-2周)
|
||||||
|
- [ ] 实现HierarchicalVectorStore
|
||||||
|
- [ ] 实现ContextualSearchEngine
|
||||||
|
- [ ] 上下文扩展逻辑
|
||||||
|
- [ ] 检索性能测试
|
||||||
|
|
||||||
|
### Phase 4: 优化与完善(2周)
|
||||||
|
- [ ] 性能优化(批量处理、增量更新)
|
||||||
|
- [ ] 降级策略完善
|
||||||
|
- [ ] 选择性向量化
|
||||||
|
- [ ] 全面测试和文档
|
||||||
|
|
||||||
|
### Phase 5: 生产部署(1周)
|
||||||
|
- [ ] CLI集成
|
||||||
|
- [ ] 配置选项暴露
|
||||||
|
- [ ] 生产环境测试
|
||||||
|
- [ ] 发布
|
||||||
|
|
||||||
|
**总计预估时间**:7-10周
|
||||||
|
|
||||||
|
## 11. 成功指标
|
||||||
|
|
||||||
|
1. **覆盖率**:95%以上的代码能被正确分词
|
||||||
|
2. **准确率**:层级关系准确率>98%
|
||||||
|
3. **检索质量**:相比单层分词,检索相关性提升30%+
|
||||||
|
4. **性能**:单文件分词<100ms,批量处理>100文件/分钟
|
||||||
|
5. **存储效率**:相比全向量化,空间占用减少40%+
|
||||||
|
|
||||||
|
## 12. 参考资料
|
||||||
|
|
||||||
|
- [Tree-sitter Documentation](https://tree-sitter.github.io/)
|
||||||
|
- [AST-based Code Analysis](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
|
||||||
|
- [Hierarchical Text Segmentation](https://arxiv.org/abs/2104.08836)
|
||||||
|
- 现有代码:`src/codexlens/semantic/chunker.py`
|
||||||
1113
codex-lens/docs/SEMANTIC_GRAPH_DESIGN.md
Normal file
1113
codex-lens/docs/SEMANTIC_GRAPH_DESIGN.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user