feat: Add API indexer and enhance embedding management

- Add new API indexer script for document processing
- Update embedding manager with improved functionality
- Remove old cache files and update dependencies
- Modify workflow execute documentation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
catlog22
2025-09-23 19:40:22 +08:00
parent 984fa3a4f3
commit 410d0efd7b
8 changed files with 506 additions and 337 deletions

View File

@@ -10,26 +10,80 @@ examples:
# Workflow Execute Command
## Overview
Coordinates agents for executing workflow tasks through automatic discovery and orchestration. Discovers plans, checks statuses, and executes ready tasks with complete context.
Orchestrates autonomous workflow execution through systematic task discovery, agent coordination, and progress tracking. Provides complete context to agents and ensures proper flow control execution.
## Core Responsibilities
- **Session Discovery**: Identify and select active workflow sessions
- **Task Dependency Resolution**: Analyze task relationships and execution order
- **TodoWrite Progress Tracking**: Maintain real-time execution status
- **Agent Orchestration**: Coordinate specialized agents with complete context
- **Flow Control Execution**: Execute pre-analysis steps and context accumulation
- **Status Synchronization**: Update task JSON files and workflow state
## Execution Philosophy
- **Discovery-first**: Auto-discover existing plans and tasks
- **Status-aware**: Execute only ready tasks
- **Context-rich**: Use complete task JSON data for agents
- **Progress tracking**: Update status after completion
- **Status-aware**: Execute only ready tasks with resolved dependencies
- **Context-rich**: Provide complete task JSON and accumulated context to agents
- **Progress tracking**: Real-time TodoWrite updates and status synchronization
- **Flow control**: Sequential step execution with variable passing
## Flow Control Execution
**[FLOW_CONTROL]** marker indicates sequential step execution required:
- **Auto-trigger**: When `task.flow_control.pre_analysis` exists
- **Process**: Execute steps sequentially BEFORE implementation
- Load dependency summaries and parent context
- Execute CLI tools, scripts, and commands as specified
- Pass context between steps via `${variable_name}`
- Handle errors per step strategy
**[FLOW_CONTROL]** marker indicates sequential step execution required for context gathering and preparation.
## Execution Flow
### Flow Control Rules
1. **Auto-trigger**: When `task.flow_control.pre_analysis` array exists in task JSON
2. **Sequential Processing**: Execute steps in order, accumulating context
3. **Variable Passing**: Use `[variable_name]` syntax to reference step outputs
4. **Error Handling**: Follow step-specific error strategies (`fail`, `skip_optional`, `retry_once`)
### 1. Discovery Phase
### Execution Pattern
```
Step 1: load_dependencies → dependency_context
Step 2: analyze_patterns [dependency_context] → pattern_analysis
Step 3: implement_solution [pattern_analysis] [dependency_context] → implementation
```
### Context Accumulation Process
- **Load Dependencies**: Retrieve summaries from `context.depends_on` tasks
- **Execute Analysis**: Run CLI tools with accumulated context
- **Prepare Implementation**: Build comprehensive context for agent execution
- **Pass to Agent**: Provide all accumulated context for implementation
## Execution Lifecycle
### Phase 1: Discovery
1. **Check Active Sessions**: Find `.workflow/.active-*` markers
2. **Select Session**: If multiple found, prompt user selection
3. **Load Session State**: Read `workflow-session.json` and `IMPL_PLAN.md`
4. **Scan Tasks**: Analyze `.task/*.json` files for ready tasks
### Phase 2: Analysis
1. **Dependency Resolution**: Build execution order based on `depends_on`
2. **Status Validation**: Filter tasks with `status: "pending"` and met dependencies
3. **Agent Assignment**: Determine agent type from `meta.agent` or `meta.type`
4. **Context Preparation**: Load dependency summaries and inherited context
### Phase 3: Planning
1. **Create TodoWrite List**: Generate task list with status markers
2. **Mark Initial Status**: Set first task as `in_progress`
3. **Prepare Session Context**: Inject workflow paths for agent use
4. **Validate Prerequisites**: Ensure all required context is available
### Phase 4: Execution
1. **Execute Flow Control**: Run `pre_analysis` steps if present
2. **Launch Agent**: Invoke specialized agent with complete context
3. **Monitor Progress**: Track agent execution and handle errors
4. **Collect Results**: Gather implementation results and outputs
### Phase 5: Completion
1. **Update Task Status**: Mark completed tasks in JSON files
2. **Generate Summary**: Create task summary in `.summaries/`
3. **Update TodoWrite**: Mark current task complete, advance to next
4. **Synchronize State**: Update session state and workflow status
## Task Discovery & Queue Building
### Session Discovery Process
```
├── Check for .active-* markers in .workflow/
├── If multiple active sessions found → Prompt user to select
@@ -40,123 +94,6 @@ Coordinates agents for executing workflow tasks through automatic discovery and
└── Build execution queue of ready tasks from selected session
```
### 2. TodoWrite Coordination
Create comprehensive TodoWrite based on discovered tasks:
```markdown
# Workflow Execute Coordination
*Session: WFS-[topic-slug]*
- [ ] **TASK-001**: [Agent: code-developer] [FLOW_CONTROL] Design auth schema (IMPL-1.1)
- [ ] **TASK-002**: [Agent: code-developer] [FLOW_CONTROL] Implement auth logic (IMPL-1.2)
- [ ] **TASK-003**: [Agent: code-review-agent] Review implementations
- [ ] **TASK-004**: Update task statuses and session state
**Marker Legend**:
- [FLOW_CONTROL] = Agent must execute flow control steps with context accumulation
```
### 3. Agent Context Assignment
**Task JSON Structure**:
```json
{
"id": "IMPL-1.1",
"title": "Design auth schema",
"status": "pending",
"meta": { "type": "feature", "agent": "code-developer" },
"context": {
"requirements": ["JWT authentication", "User model design"],
"focus_paths": ["src/auth/models", "tests/auth"],
"acceptance": ["Schema validates JWT tokens"],
"depends_on": [],
"inherited": { "from": "IMPL-1", "context": ["..."] }
},
"flow_control": {
"pre_analysis": [
{
"step": "analyze_patterns",
"action": "Analyze existing auth patterns",
"command": "~/.claude/scripts/gemini-wrapper -p '@{src/auth/**/*} analyze patterns'",
"output_to": "pattern_analysis",
"on_error": "fail"
}
],
"implementation_approach": "Design flexible user schema",
"target_files": ["src/auth/models/User.ts:UserSchema:10-50"]
}
}
```
**Context Assignment Rules**:
- Use complete task JSON including flow_control
- Load dependency summaries from context.depends_on
- Execute flow_control.pre_analysis steps sequentially
- Direct agents to context.focus_paths
- Auto-add [FLOW_CONTROL] marker when pre_analysis exists
### 4. Agent Execution Pattern
```bash
Task(subagent_type="code-developer",
prompt="[FLOW_CONTROL] Execute IMPL-1.2: Implement JWT authentication system with flow control
Task Context: IMPL-1.2 - Flow control managed execution
FLOW CONTROL EXECUTION:
Execute the following steps sequentially with context accumulation:
Step 1 (gather_context): Load dependency summaries
Command: for dep in ${depends_on}; do cat .summaries/$dep-summary.md 2>/dev/null || echo "No summary for $dep"; done
Output: dependency_context
Step 2 (analyze_patterns): Analyze existing auth patterns
Command: ~/.claude/scripts/gemini-wrapper -p '@{src/auth/**/*} analyze authentication patterns with context: [dependency_context]'
Output: pattern_analysis
Step 3 (implement): Implement JWT based on analysis
Command: codex --full-auto exec 'Implement JWT using analysis: [pattern_analysis] and context: [dependency_context]' -s danger-full-access
Session Context:
- Workflow Directory: .workflow/WFS-user-auth/
- TODO_LIST Location: .workflow/WFS-user-auth/TODO_LIST.md
- Summaries Directory: .workflow/WFS-user-auth/.summaries/
- Task JSON Location: .workflow/WFS-user-auth/.task/IMPL-1.2.json
Implementation Guidance:
- Approach: Design flexible user schema supporting JWT and OAuth authentication
- Target Files: src/auth/models/User.ts:UserSchema:10-50
- Focus Paths: src/auth/models, tests/auth
- Dependencies: From context.depends_on
- Inherited Context: [context.inherited]
IMPORTANT:
1. Execute flow control steps in sequence with error handling
2. Accumulate context through step chain
3. Provide detailed completion report for summary generation
4. Mark task as completed - system will auto-generate summary and update TODO_LIST.md",
description="Execute task with flow control step processing")
```
**Execution Protocol**:
- Sequential execution respecting dependencies
- Progress tracking through TodoWrite updates
- Status updates after completion
- Cross-agent result coordination
## File Structure & Analysis
### Workflow Structure
```
.workflow/WFS-[topic-slug]/
├── workflow-session.json # Session state
├── IMPL_PLAN.md # Requirements
├── .task/ # Task definitions
│ ├── IMPL-1.json
│ └── IMPL-1.1.json
└── .summaries/ # Completion summaries
```
### Task Status Logic
```
pending + dependencies_met → executable
@@ -164,57 +101,294 @@ completed → skip
blocked → skip until dependencies clear
```
### Agent Assignment
- **task.agent field**: Use specified agent
- **task.type fallback**:
## TodoWrite Coordination
**Real-time progress tracking** with immediate status updates:
#### TodoWrite Workflow Rules
1. **Initial Creation**: Generate TodoWrite from discovered pending tasks
2. **Single In-Progress**: Mark ONLY ONE task as `in_progress` at a time
3. **Immediate Updates**: Update status after each task completion
4. **Status Synchronization**: Sync with JSON task files after updates
#### TodoWrite Template
```markdown
# Workflow Execute Progress
*Session: WFS-[topic-slug]*
- [⚠️] **IMPL-1.1**: [code-developer] [FLOW_CONTROL] Design auth schema
- [ ] **IMPL-1.2**: [code-developer] [FLOW_CONTROL] Implement auth logic
- [ ] **IMPL-2**: [code-review-agent] Review implementations
**Status Legend**:
- [ ] = Pending task
- [⚠️] = Currently in progress
- [✅] = Completed task
- [FLOW_CONTROL] = Requires pre-analysis step execution
```
#### Update Timing
- **Before Agent Launch**: Mark task as `in_progress` (⚠️)
- **After Task Complete**: Mark as `completed` (✅), advance to next
- **On Error**: Keep as `in_progress`, add error note
- **Session End**: Sync all statuses with JSON files
### 3. Agent Context Management
**Comprehensive context preparation** for autonomous agent execution:
#### Context Sources (Priority Order)
1. **Complete Task JSON**: Full task definition including all fields
2. **Flow Control Context**: Accumulated outputs from pre_analysis steps
3. **Dependency Summaries**: Previous task completion summaries
4. **Session Context**: Workflow paths and session metadata
5. **Inherited Context**: Parent task context and shared variables
#### Context Assembly Process
```
1. Load Task JSON → Base context
2. Execute Flow Control → Accumulated context
3. Load Dependencies → Dependency context
4. Prepare Session Paths → Session context
5. Combine All → Complete agent context
```
#### Agent Context Package Structure
```json
{
"task": { /* Complete task JSON */ },
"flow_context": {
"step_outputs": { "pattern_analysis": "...", "dependency_context": "..." }
},
"session": {
"workflow_dir": ".workflow/WFS-session/",
"todo_list_path": ".workflow/WFS-session/TODO_LIST.md",
"summaries_dir": ".workflow/WFS-session/.summaries/",
"task_json_path": ".workflow/WFS-session/.task/IMPL-1.1.json"
},
"dependencies": [ /* Task summaries from depends_on */ ],
"inherited": { /* Parent task context */ }
}
```
#### Context Validation Rules
- **Task JSON Complete**: All 5 fields present and valid
- **Flow Control Ready**: All pre_analysis steps completed if present
- **Dependencies Loaded**: All depends_on summaries available
- **Session Paths Valid**: All workflow paths exist and accessible
- **Agent Assignment**: Valid agent type specified in meta.agent
### 4. Agent Execution Pattern
**Structured agent invocation** with complete context and clear instructions:
#### Agent Prompt Template
```bash
Task(subagent_type="{agent_type}",
prompt="Execute {task_id}: {task_title}
## Task Definition
**ID**: {task_id}
**Type**: {task_type}
**Agent**: {assigned_agent}
## Execution Instructions
{flow_control_marker}
### Flow Control Steps (if [FLOW_CONTROL] present)
Execute sequentially with context accumulation:
{pre_analysis_steps}
### Implementation Context
**Requirements**: {context.requirements}
**Focus Paths**: {context.focus_paths}
**Acceptance Criteria**: {context.acceptance}
**Target Files**: {flow_control.target_files}
### Session Context
**Workflow Directory**: {session.workflow_dir}
**TODO List Path**: {session.todo_list_path}
**Summaries Directory**: {session.summaries_dir}
**Task JSON Path**: {session.task_json_path}
### Dependencies & Context
**Dependencies**: {context.depends_on}
**Inherited Context**: {context.inherited}
**Previous Outputs**: {flow_context.step_outputs}
## Completion Requirements
1. Execute all flow control steps if present
2. Implement according to acceptance criteria
3. Update TODO_LIST.md at provided path
4. Generate summary in summaries directory
5. Mark task as completed in task JSON",
description="{task_description}")
```
#### Execution Flow
1. **Prepare Agent Context**: Assemble complete context package
2. **Generate Prompt**: Fill template with task and context data
3. **Launch Agent**: Invoke specialized agent with structured prompt
4. **Monitor Execution**: Track progress and handle errors
5. **Collect Results**: Process agent outputs and update status
#### Agent Assignment Rules
```
meta.agent specified → Use specified agent
meta.agent missing → Infer from meta.type:
- "feature" → code-developer
- "test" → code-review-test-agent
- "review" → code-review-agent
## Status Management & Coordination
### Task Status Updates
```json
// Before execution
{ "id": "IMPL-1.2", "status": "pending", "execution": { "attempts": 0 } }
// After execution
{ "id": "IMPL-1.2", "status": "completed", "execution": { "attempts": 1, "last_attempt": "2025-09-08T14:30:00Z" } }
- "docs" → doc-generator
```
### Coordination Strategies
- **Dependencies**: Execute in dependency order
- **Agent Handoffs**: Pass results between agents
- **Progress Updates**: Update TodoWrite and JSON files
- **Context Distribution**: Complete task JSON + workflow context
- **Focus Areas**: Direct agents to specific paths from task.context.focus_paths
#### Error Handling During Execution
- **Agent Failure**: Retry once with adjusted context
- **Flow Control Error**: Skip optional steps, fail on critical
- **Context Missing**: Reload from JSON files and retry
- **Timeout**: Mark as blocked, continue with next task
## Error Handling
## Workflow File Structure Reference
```
.workflow/WFS-[topic-slug]/
├── workflow-session.json # Session state and metadata
├── IMPL_PLAN.md # Planning document and requirements
├── TODO_LIST.md # Progress tracking (auto-updated)
├── .task/ # Task definitions (JSON only)
│ ├── IMPL-1.json # Main task definitions
│ └── IMPL-1.1.json # Subtask definitions
├── .summaries/ # Task completion summaries
│ ├── IMPL-1-summary.md # Task completion details
│ └── IMPL-1.1-summary.md # Subtask completion details
└── .process/ # Planning artifacts
└── ANALYSIS_RESULTS.md # Planning analysis results
```
### Discovery Issues
## Error Handling & Recovery
### Discovery Phase Errors
| Error | Cause | Resolution | Command |
|-------|-------|------------|---------|
| No active session | No `.active-*` markers found | Create or resume session | `/workflow:plan "project"` |
| Multiple sessions | Multiple `.active-*` markers | Select specific session | Manual choice prompt |
| Corrupted session | Invalid JSON files | Recreate session structure | `/workflow:status --validate` |
| Missing task files | Broken task references | Regenerate tasks | `/task:create` or repair |
### Execution Phase Errors
| Error | Cause | Recovery Strategy | Max Attempts |
|-------|-------|------------------|--------------|
| Agent failure | Agent crash/timeout | Retry with simplified context | 2 |
| Flow control error | Command failure | Skip optional, fail critical | 1 per step |
| Context loading error | Missing dependencies | Reload from JSON, use defaults | 3 |
| JSON file corruption | File system issues | Restore from backup/recreate | 1 |
### Recovery Procedures
#### Session Recovery
```bash
❌ No active workflow session → Use: /workflow:session:start "project"
⚠️ All tasks completed/blocked → Check: /context for status
❌ Missing task files → Fix: /task/create or repair references
# Check session integrity
find .workflow -name ".active-*" | while read marker; do
session=$(basename "$marker" | sed 's/^\.active-//')
if [ ! -d ".workflow/$session" ]; then
echo "Removing orphaned marker: $marker"
rm "$marker"
fi
done
# Recreate corrupted session files
if [ ! -f ".workflow/$session/workflow-session.json" ]; then
echo '{"session_id":"'$session'","status":"active"}' > ".workflow/$session/workflow-session.json"
fi
```
### Execution Recovery
- **Failed Agent**: Retry with adjusted context
- **Blocked Dependencies**: Skip and continue with available tasks
- **Context Issues**: Reload from JSON files and session state
## Integration & Next Steps
### Automatic Behaviors
- Discovery on start - analyze workflow folder structure
- TodoWrite coordination - generate based on discovered tasks
- Agent context preparation - use complete task JSON data
- Status synchronization - update JSON files after completion
### Next Actions
#### Task Recovery
```bash
/context # View updated task status
/task:execute IMPL-X # Execute specific remaining tasks
/workflow:review # Move to review phase when complete
# Validate task JSON integrity
for task_file in .workflow/$session/.task/*.json; do
if ! jq empty "$task_file" 2>/dev/null; then
echo "Corrupted task file: $task_file"
# Backup and regenerate or restore from backup
fi
done
# Fix missing dependencies
missing_deps=$(jq -r '.context.depends_on[]?' .workflow/$session/.task/*.json | sort -u)
for dep in $missing_deps; do
if [ ! -f ".workflow/$session/.task/$dep.json" ]; then
echo "Missing dependency: $dep - creating placeholder"
fi
done
```
#### Context Recovery
```bash
# Reload context from available sources
if [ -f ".workflow/$session/.process/ANALYSIS_RESULTS.md" ]; then
echo "Reloading planning context..."
fi
# Restore from documentation if available
if [ -d ".workflow/docs/" ]; then
echo "Using documentation context as fallback..."
fi
```
### Error Prevention
- **Pre-flight Checks**: Validate session integrity before execution
- **Backup Strategy**: Create task snapshots before major operations
- **Atomic Updates**: Update JSON files atomically to prevent corruption
- **Dependency Validation**: Check all depends_on references exist
- **Context Verification**: Ensure all required context is available
## Usage Examples & Integration
### Complete Execution Workflow
```bash
# 1. Check current status
/workflow:status
# 2. Execute workflow tasks
/workflow:execute
# 3. Monitor progress
/workflow:status --format=hierarchy
# 4. Continue with remaining tasks
/workflow:execute
# 5. Review when complete
/workflow:review
```
### Common Scenarios
#### Single Task Execution
```bash
/task:execute IMPL-1.2 # Execute specific task
```
#### Resume After Error
```bash
/workflow:status --validate # Check for issues
/workflow:execute # Resume execution
```
#### Multiple Session Management
```bash
# Will prompt for session selection if multiple active
/workflow:execute
# Or check status first
find .workflow -name ".active-*" -exec basename {} \; | sed 's/^\.active-//'
```
### Integration Points
- **Planning**: Use `/workflow:plan` to create session and tasks
- **Status**: Use `/workflow:status` for real-time progress views
- **Documentation**: Use `/workflow:docs` for context generation
- **Review**: Use `/workflow:review` for completion validation
### Key Benefits
- **Autonomous Execution**: Agents work independently with complete context
- **Progress Tracking**: Real-time TodoWrite updates and status synchronization
- **Error Recovery**: Comprehensive error handling and recovery procedures
- **Context Management**: Systematic context accumulation and distribution
- **Flow Control**: Sequential step execution with variable passing

View File

@@ -0,0 +1,141 @@
#!/usr/bin/env python3
"""
API Documentation Indexer
Parses Markdown documentation to create a searchable index of classes and methods.
"""
import os
import re
import json
import logging
from pathlib import Path
from typing import Dict, Any
from core.file_indexer import FileIndexer
class ApiIndexer:
def __init__(self, config: Dict, root_path: str = "."):
self.config = config
self.root_path = Path(root_path).resolve()
self.file_indexer = FileIndexer(config, root_path)
self.api_index_file = self.file_indexer.cache_dir / "api_index.json"
self.logger = logging.getLogger(__name__)
def build_index(self):
"""Builds the API index from Markdown files."""
self.logger.info("Building API index...")
file_index = self.file_indexer.load_index()
if not file_index:
self.logger.info("File index not found, building it first.")
self.file_indexer.build_index()
file_index = self.file_indexer.load_index()
api_index = {}
for file_info in file_index.values():
if file_info.extension == ".md":
self.logger.debug(f"Parsing {file_info.path}")
try:
with open(file_info.path, "r", encoding="utf-8") as f:
content = f.read()
self._parse_markdown(content, file_info.relative_path, api_index)
except Exception as e:
self.logger.error(f"Error parsing {file_info.path}: {e}")
self._save_index(api_index)
self.logger.info(f"API index built with {len(api_index)} classes.")
def _parse_markdown(self, content: str, file_path: str, api_index: Dict):
"""Parses a single Markdown file for class and method info."""
class_name_match = re.search(r"^#\s+([A-Za-z0-9_]+)", content)
if not class_name_match:
return
class_name = class_name_match.group(1)
api_index[class_name] = {
"file_path": file_path,
"description": "",
"methods": {}
}
# Simple description extraction
desc_match = re.search(r"\*\*Description:\*\*\s*(.+)", content)
if desc_match:
api_index[class_name]["description"] = desc_match.group(1).strip()
# Method extraction
method_sections = re.split(r"###\s+", content)[1:]
for i, section in enumerate(method_sections):
method_signature_match = re.search(r"`(.+?)`", section)
if not method_signature_match:
continue
signature = method_signature_match.group(1)
method_name_match = re.search(r"([A-Za-z0-9_]+)\(“, signature)
if not method_name_match:
continue
method_name = method_name_match.group(1)
method_description = ""
method_desc_match = re.search(r"\*\*Description:\*\*\s*(.+)", section)
if method_desc_match:
method_description = method_desc_match.group(1).strip()
# A simple way to get a line number approximation
line_number = content.count("\n", 0, content.find(f"### `{signature}`")) + 1
api_index[class_name]["methods"Показать больше] = {
"signature": signature,
"description": method_description,
"line_number": line_number
}
def _save_index(self, api_index: Dict):
"""Saves the API index to a file."""
try:
with open(self.api_index_file, "w", encoding="utf-8") as f:
json.dump(api_index, f, indent=2)
except IOError as e:
self.logger.error(f"Could not save API index: {e}")
def search(self, class_name: str, method_name: str = None) -> Any:
"""Searches the API index for a class or method."""
if not self.api_index_file.exists():
self.build_index()
with open(self.api_index_file, "r", encoding="utf-8") as f:
api_index = json.load(f)
if class_name not in api_index:
return None
if method_name:
return api_index[class_name]["methods"].get(method_name)
else:
return api_index[class_name]
if __name__ == "__main__":
from core.config import get_config
import argparse
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description="API Documentation Indexer.")
parser.add_argument("--build", action="store_true", help="Build the API index.")
parser.add_argument("--search_class", help="Search for a class.")
parser.add_argument("--search_method", help="Search for a method within a class (requires --search_class).")
args = parser.parse_args()
config = get_config()
api_indexer = ApiIndexer(config.to_dict())
if args.build:
api_indexer.build_index()
if args.search_class:
result = api_indexer.search(args.search_class, args.search_method)
if result:
print(json.dumps(result, indent=2))
else:
print("Not found.")

View File

@@ -1,156 +0,0 @@
{
"analyzer.py": {
"file_path": "analyzer.py",
"content_hash": "9a7665c34d5ac84634342f8b1425bb13",
"embedding_hash": "fb5b5a58ec8e070620747c7313b0b2b6",
"created_time": 1758175163.6748724,
"vector_size": 384
},
"config.yaml": {
"file_path": "config.yaml",
"content_hash": "fc0526eea28cf37d15425035d2dd17d9",
"embedding_hash": "4866d8bd2b14c16c448c34c0251d199e",
"created_time": 1758175163.6748896,
"vector_size": 384
},
"install.sh": {
"file_path": "install.sh",
"content_hash": "6649df913eadef34fa2f253aed541dfd",
"embedding_hash": "54af072da7c1139108c79b64bd1ee291",
"created_time": 1758175163.6748998,
"vector_size": 384
},
"requirements.txt": {
"file_path": "requirements.txt",
"content_hash": "e981a0aa103bdec4a99b75831967766d",
"embedding_hash": "37bc877ea041ad606234262423cf578a",
"created_time": 1758175163.6749053,
"vector_size": 384
},
"setup.py": {
"file_path": "setup.py",
"content_hash": "7b93af473bfe37284c6cf493458bc421",
"embedding_hash": "bdda9a6e8d3bd34465436b119a17e263",
"created_time": 1758175163.6749127,
"vector_size": 384
},
"__init__.py": {
"file_path": "__init__.py",
"content_hash": "c981c4ffc664bbd3c253d0dc82f48ac6",
"embedding_hash": "3ab1a0c5d0d4bd832108b7a6ade0ad9c",
"created_time": 1758175163.6749194,
"vector_size": 384
},
"cache\\file_index.json": {
"file_path": "cache\\file_index.json",
"content_hash": "6534fef14d12e39aff1dc0dcf5b91d1d",
"embedding_hash": "d76efa530f0d21e52f9d5b3a9ccc358c",
"created_time": 1758175163.6749268,
"vector_size": 384
},
"core\\config.py": {
"file_path": "core\\config.py",
"content_hash": "ee72a95cea7397db8dd25b10a4436eaa",
"embedding_hash": "65d1fca1cf59bcd36409c3b11f50aab1",
"created_time": 1758175163.6749349,
"vector_size": 384
},
"core\\context_analyzer.py": {
"file_path": "core\\context_analyzer.py",
"content_hash": "2e9ac2050e463c9d3f94bad23e65d4e5",
"embedding_hash": "dfb51c8eaafd96ac544b3d9c8dcd3f51",
"created_time": 1758175163.674943,
"vector_size": 384
},
"core\\embedding_manager.py": {
"file_path": "core\\embedding_manager.py",
"content_hash": "cafa24b0431c6463266dde8b37fc3ab7",
"embedding_hash": "531c3206f0caf9789873719cdd644e99",
"created_time": 1758175163.6749508,
"vector_size": 384
},
"core\\file_indexer.py": {
"file_path": "core\\file_indexer.py",
"content_hash": "0626c89c060d6022261ca094aed47093",
"embedding_hash": "93d5fc6e84334d3bd9be0f07f9823b20",
"created_time": 1758175163.6749592,
"vector_size": 384
},
"core\\gitignore_parser.py": {
"file_path": "core\\gitignore_parser.py",
"content_hash": "5f1d87fb03bc3b19833406be0fa5125f",
"embedding_hash": "784be673b6b428cce60ab5390bfc7f08",
"created_time": 1758175163.6749675,
"vector_size": 384
},
"core\\path_matcher.py": {
"file_path": "core\\path_matcher.py",
"content_hash": "89132273951a091610c1579ccc44f3a7",
"embedding_hash": "e01ca0180c2834a514ad6d8e62315ce0",
"created_time": 1758175163.6749754,
"vector_size": 384
},
"core\\__init__.py": {
"file_path": "core\\__init__.py",
"content_hash": "3a323be141f1ce6b9d9047aa444029b0",
"embedding_hash": "3fc5a5427067e59b054428083a5899ca",
"created_time": 1758175163.6749818,
"vector_size": 384
},
"tools\\module_analyzer.py": {
"file_path": "tools\\module_analyzer.py",
"content_hash": "926289c2fd8d681ed20c445d2ac34fa1",
"embedding_hash": "3378fcde062914859b765d8dfce1207f",
"created_time": 1758175163.67499,
"vector_size": 384
},
"tools\\tech_stack.py": {
"file_path": "tools\\tech_stack.py",
"content_hash": "eef6eabcbc8ba0ece0dfacb9314f3585",
"embedding_hash": "bc3aa5334ef17328490bc5a8162d776a",
"created_time": 1758175163.674997,
"vector_size": 384
},
"tools\\workflow_updater.py": {
"file_path": "tools\\workflow_updater.py",
"content_hash": "40d7d884e0db24eb45aa27739fef8210",
"embedding_hash": "00488f4acdb7fe1b5126da4da3bb9869",
"created_time": 1758175163.6750047,
"vector_size": 384
},
"tools\\__init__.py": {
"file_path": "tools\\__init__.py",
"content_hash": "41bf583571f4355e4af90842d0674b1f",
"embedding_hash": "fccd7745f9e1e242df3bace7cee9759c",
"created_time": 1758175163.6750097,
"vector_size": 384
},
"utils\\cache.py": {
"file_path": "utils\\cache.py",
"content_hash": "dc7c08bcd9af9ae465020997e4b9127e",
"embedding_hash": "68394bc0f57a0f66b83a57249b39957d",
"created_time": 1758175163.6750169,
"vector_size": 384
},
"utils\\colors.py": {
"file_path": "utils\\colors.py",
"content_hash": "8ce555a2dcf4057ee7adfb3286d47da2",
"embedding_hash": "1b18e22acb095e83ed291b6c5dc7a2ce",
"created_time": 1758175163.6750243,
"vector_size": 384
},
"utils\\io_helpers.py": {
"file_path": "utils\\io_helpers.py",
"content_hash": "fb276a0e46b28f80d5684368a8b15e57",
"embedding_hash": "f6ff8333b1afc5b98d4644f334c18cda",
"created_time": 1758175163.6750326,
"vector_size": 384
},
"utils\\__init__.py": {
"file_path": "utils\\__init__.py",
"content_hash": "f305ede9cbdec2f2e0189a4b89558b7e",
"embedding_hash": "7d3f10fe4210d40eafd3c065b8e0c8b7",
"created_time": 1758175163.6750393,
"vector_size": 384
}
}

Binary file not shown.

View File

@@ -66,11 +66,12 @@ file_extensions:
# Embedding/RAG configuration
embedding:
enabled: true # Set to true to enable RAG features
model: "all-MiniLM-L6-v2" # Lightweight sentence transformer
model: "codesage/codesage-large-v2" # CodeSage V2 for code embeddings
cache_dir: "cache"
similarity_threshold: 0.3
max_context_length: 512
batch_size: 32
similarity_threshold: 0.6 # Higher threshold for better code similarity
max_context_length: 2048 # Increased for CodeSage V2 capabilities
batch_size: 8 # Reduced for larger model
trust_remote_code: true # Required for CodeSage V2
# Context analysis settings
context_analysis:

View File

@@ -75,6 +75,7 @@ class EmbeddingManager:
self.similarity_threshold = config.get('embedding', {}).get('similarity_threshold', 0.6)
self.max_context_length = config.get('embedding', {}).get('max_context_length', 512)
self.batch_size = config.get('embedding', {}).get('batch_size', 32)
self.trust_remote_code = config.get('embedding', {}).get('trust_remote_code', False)
# Setup cache directories
self.cache_dir.mkdir(parents=True, exist_ok=True)
@@ -95,7 +96,11 @@ class EmbeddingManager:
if self._model is None:
try:
self.logger.info(f"Loading embedding model: {self.model_name}")
self._model = SentenceTransformer(self.model_name)
# Initialize with trust_remote_code for CodeSage V2
if self.trust_remote_code:
self._model = SentenceTransformer(self.model_name, trust_remote_code=True)
else:
self._model = SentenceTransformer(self.model_name)
self.logger.info(f"Model loaded successfully")
except Exception as e:
self.logger.error(f"Failed to load embedding model: {e}")
@@ -203,7 +208,7 @@ class EmbeddingManager:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
# Truncate content if too long
# Truncate content if too long (CodeSage V2 supports longer contexts)
if len(content) > self.max_context_length * 4: # Approximate token limit
content = content[:self.max_context_length * 4]

View File

@@ -2,14 +2,18 @@
numpy>=1.21.0
scikit-learn>=1.0.0
# Sentence Transformers for advanced embeddings
sentence-transformers>=2.2.0
# Sentence Transformers for advanced embeddings (CodeSage V2 compatible)
sentence-transformers>=3.0.0
transformers>=4.40.0
# Optional: For better performance and additional models
torch>=1.9.0
# PyTorch for model execution (required for CodeSage V2)
torch>=2.0.0
# Development and testing
pytest>=6.0.0
# Data handling
pandas>=1.3.0
# Additional dependencies for CodeSage V2
accelerate>=0.26.0