mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-02-13 02:41:50 +08:00
Add scripts for inspecting LLM summaries and testing misleading comments
- Implement `inspect_llm_summaries.py` to display LLM-generated summaries from the semantic_chunks table in the database. - Create `show_llm_analysis.py` to demonstrate LLM analysis of misleading code examples, highlighting discrepancies between comments and actual functionality. - Develop `test_misleading_comments.py` to compare pure vector search with LLM-enhanced search, focusing on the impact of misleading or missing comments on search results. - Introduce `test_llm_enhanced_search.py` to provide a test suite for evaluating the effectiveness of LLM-enhanced vector search against pure vector search. - Ensure all new scripts are integrated with the existing codebase and follow the established coding standards.
This commit is contained in:
463
codex-lens/docs/LLM_ENHANCED_SEARCH_GUIDE.md
Normal file
463
codex-lens/docs/LLM_ENHANCED_SEARCH_GUIDE.md
Normal file
@@ -0,0 +1,463 @@
|
||||
# LLM-Enhanced Semantic Search Guide
|
||||
|
||||
**Last Updated**: 2025-12-16
|
||||
**Status**: Experimental Feature
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
CodexLens supports two approaches for semantic vector search:
|
||||
|
||||
| Approach | Pipeline | Best For |
|
||||
|----------|----------|----------|
|
||||
| **Pure Vector** | Code → fastembed → search | Code pattern matching, exact functionality |
|
||||
| **LLM-Enhanced** | Code → LLM summary → fastembed → search | Natural language queries, conceptual search |
|
||||
|
||||
### Why LLM Enhancement?
|
||||
|
||||
**Problem**: Raw code embeddings don't match natural language well.
|
||||
|
||||
```
|
||||
Query: "How do I hash passwords securely?"
|
||||
Raw code: def hash_password(password: str) -> str: ...
|
||||
Mismatch: Low semantic similarity
|
||||
```
|
||||
|
||||
**Solution**: LLM generates natural language summaries.
|
||||
|
||||
```
|
||||
Query: "How do I hash passwords securely?"
|
||||
LLM Summary: "Hash a password using bcrypt with specified salt rounds for secure storage"
|
||||
Match: High semantic similarity ✓
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Pure Vector Search Flow
|
||||
|
||||
```
|
||||
1. Code File
|
||||
└→ "def hash_password(password: str): ..."
|
||||
|
||||
2. Chunking
|
||||
└→ Split into semantic chunks (500-2000 chars)
|
||||
|
||||
3. Embedding (fastembed)
|
||||
└→ Generate 768-dim vector from raw code
|
||||
|
||||
4. Storage
|
||||
└→ Store vector in semantic_chunks table
|
||||
|
||||
5. Query
|
||||
└→ "How to hash passwords"
|
||||
└→ Generate query vector
|
||||
└→ Find similar vectors (cosine similarity)
|
||||
```
|
||||
|
||||
**Pros**: Fast, no external dependencies, good for code patterns
|
||||
**Cons**: Poor semantic match for natural language queries
|
||||
|
||||
### LLM-Enhanced Search Flow
|
||||
|
||||
```
|
||||
1. Code File
|
||||
└→ "def hash_password(password: str): ..."
|
||||
|
||||
2. LLM Analysis (Gemini/Qwen via CCW)
|
||||
└→ Generate summary: "Hash a password using bcrypt..."
|
||||
└→ Extract keywords: ["password", "hash", "bcrypt", "security"]
|
||||
└→ Identify purpose: "auth"
|
||||
|
||||
3. Embeddable Text Creation
|
||||
└→ Combine: summary + keywords + purpose + filename
|
||||
|
||||
4. Embedding (fastembed)
|
||||
└→ Generate 768-dim vector from LLM text
|
||||
|
||||
5. Storage
|
||||
└→ Store vector with metadata
|
||||
|
||||
6. Query
|
||||
└→ "How to hash passwords"
|
||||
└→ Generate query vector
|
||||
└→ Find similar vectors → Better match! ✓
|
||||
```
|
||||
|
||||
**Pros**: Excellent semantic match for natural language
|
||||
**Cons**: Slower, requires CCW CLI and LLM access
|
||||
|
||||
## Setup Requirements
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
# Install semantic search dependencies
|
||||
pip install codexlens[semantic]
|
||||
|
||||
# Install CCW CLI for LLM enhancement
|
||||
npm install -g ccw
|
||||
```
|
||||
|
||||
### 2. Configure LLM Tools
|
||||
|
||||
```bash
|
||||
# Set primary LLM tool (default: gemini)
|
||||
export CCW_CLI_SECONDARY_TOOL=gemini
|
||||
|
||||
# Set fallback tool (default: qwen)
|
||||
export CCW_CLI_FALLBACK_TOOL=qwen
|
||||
|
||||
# Configure API keys (see CCW documentation)
|
||||
ccw config set gemini.apiKey YOUR_API_KEY
|
||||
```
|
||||
|
||||
### 3. Verify Setup
|
||||
|
||||
```bash
|
||||
# Check CCW availability
|
||||
ccw --version
|
||||
|
||||
# Check semantic dependencies
|
||||
python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"
|
||||
```
|
||||
|
||||
## Running Comparison Tests
|
||||
|
||||
### Method 1: Standalone Script (Recommended)
|
||||
|
||||
```bash
|
||||
# Run full comparison (pure vector + LLM-enhanced)
|
||||
python scripts/compare_search_methods.py
|
||||
|
||||
# Use specific LLM tool
|
||||
python scripts/compare_search_methods.py --tool gemini
|
||||
python scripts/compare_search_methods.py --tool qwen
|
||||
|
||||
# Skip LLM test (only pure vector)
|
||||
python scripts/compare_search_methods.py --skip-llm
|
||||
```
|
||||
|
||||
**Output Example**:
|
||||
|
||||
```
|
||||
======================================================================
|
||||
SEMANTIC SEARCH COMPARISON TEST
|
||||
Pure Vector vs LLM-Enhanced Vector Search
|
||||
======================================================================
|
||||
|
||||
Test dataset: 5 Python files
|
||||
Test queries: 5 natural language questions
|
||||
|
||||
======================================================================
|
||||
PURE VECTOR SEARCH (Code → fastembed)
|
||||
======================================================================
|
||||
Setup: 5 files, 23 chunks in 2.3s
|
||||
|
||||
Query Top Result Score
|
||||
----------------------------------------------------------------------
|
||||
✓ How do I securely hash passwords? password_hasher.py 0.723
|
||||
✗ Generate JWT token for authentication user_endpoints.py 0.645
|
||||
✓ Create new user account via API user_endpoints.py 0.812
|
||||
✓ Validate email address format validation.py 0.756
|
||||
~ Connect to PostgreSQL database connection.py 0.689
|
||||
|
||||
======================================================================
|
||||
LLM-ENHANCED SEARCH (Code → GEMINI → fastembed)
|
||||
======================================================================
|
||||
Generating LLM summaries for 5 files...
|
||||
Setup: 5/5 files indexed in 8.7s
|
||||
|
||||
Query Top Result Score
|
||||
----------------------------------------------------------------------
|
||||
✓ How do I securely hash passwords? password_hasher.py 0.891
|
||||
✓ Generate JWT token for authentication jwt_handler.py 0.867
|
||||
✓ Create new user account via API user_endpoints.py 0.923
|
||||
✓ Validate email address format validation.py 0.845
|
||||
✓ Connect to PostgreSQL database connection.py 0.801
|
||||
|
||||
======================================================================
|
||||
COMPARISON SUMMARY
|
||||
======================================================================
|
||||
|
||||
Query Pure LLM
|
||||
----------------------------------------------------------------------
|
||||
How do I securely hash passwords? ✓ Rank 1 ✓ Rank 1
|
||||
Generate JWT token for authentication ✗ Miss ✓ Rank 1
|
||||
Create new user account via API ✓ Rank 1 ✓ Rank 1
|
||||
Validate email address format ✓ Rank 1 ✓ Rank 1
|
||||
Connect to PostgreSQL database ~ Rank 2 ✓ Rank 1
|
||||
----------------------------------------------------------------------
|
||||
TOTAL SCORE 11 15
|
||||
======================================================================
|
||||
|
||||
ANALYSIS:
|
||||
✓ LLM enhancement improves results by 36.4%
|
||||
Natural language summaries match queries better than raw code
|
||||
```
|
||||
|
||||
### Method 2: Pytest Test Suite
|
||||
|
||||
```bash
|
||||
# Run full test suite
|
||||
pytest tests/test_llm_enhanced_search.py -v -s
|
||||
|
||||
# Run specific test
|
||||
pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s
|
||||
|
||||
# Skip LLM tests if CCW not available
|
||||
pytest tests/test_llm_enhanced_search.py -v -s -k "not llm_enhanced"
|
||||
```
|
||||
|
||||
## Using LLM Enhancement in Production
|
||||
|
||||
### Option 1: Enhanced Embeddings Generation (Recommended)
|
||||
|
||||
Create embeddings with LLM enhancement during indexing:
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from codexlens.semantic.llm_enhancer import create_enhanced_indexer, FileData
|
||||
|
||||
# Create enhanced indexer
|
||||
indexer = create_enhanced_indexer(
|
||||
vector_store_path=Path("~/.codexlens/indexes/project/_index.db"),
|
||||
llm_tool="gemini",
|
||||
llm_enabled=True,
|
||||
)
|
||||
|
||||
# Prepare file data
|
||||
files = [
|
||||
FileData(
|
||||
path="auth/password_hasher.py",
|
||||
content=open("auth/password_hasher.py").read(),
|
||||
language="python"
|
||||
),
|
||||
# ... more files
|
||||
]
|
||||
|
||||
# Index with LLM enhancement
|
||||
indexed_count = indexer.index_files(files)
|
||||
print(f"Indexed {indexed_count} files with LLM enhancement")
|
||||
```
|
||||
|
||||
### Option 2: CLI Integration (Coming Soon)
|
||||
|
||||
```bash
|
||||
# Generate embeddings with LLM enhancement
|
||||
codexlens embeddings-generate ~/projects/my-app --llm-enhanced --tool gemini
|
||||
|
||||
# Check which strategy was used
|
||||
codexlens embeddings-status ~/projects/my-app --show-strategies
|
||||
```
|
||||
|
||||
**Note**: CLI integration is planned but not yet implemented. Currently use Option 1 (Python API).
|
||||
|
||||
### Option 3: Hybrid Approach
|
||||
|
||||
Combine both strategies for best results:
|
||||
|
||||
```python
|
||||
# Generate both pure and LLM-enhanced embeddings
|
||||
# 1. Pure vector for exact code matching
|
||||
generate_pure_embeddings(files)
|
||||
|
||||
# 2. LLM-enhanced for semantic matching
|
||||
generate_llm_embeddings(files)
|
||||
|
||||
# Search uses both and ranks by best match
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Speed Comparison
|
||||
|
||||
| Approach | Indexing Time (100 files) | Query Time | Cost |
|
||||
|----------|---------------------------|------------|------|
|
||||
| Pure Vector | ~30s | ~50ms | Free |
|
||||
| LLM-Enhanced | ~5-10 min | ~50ms | LLM API costs |
|
||||
|
||||
**LLM indexing is slower** because:
|
||||
- Calls external LLM API (gemini/qwen)
|
||||
- Processes files in batches (default: 5 files/batch)
|
||||
- Waits for LLM response (~2-5s per batch)
|
||||
|
||||
**Query speed is identical** because:
|
||||
- Both use fastembed for similarity search
|
||||
- Vector lookup is same speed
|
||||
- Difference is only in what was embedded
|
||||
|
||||
### Cost Estimation
|
||||
|
||||
**Gemini Flash (via CCW)**:
|
||||
- ~$0.10 per 1M input tokens
|
||||
- Average: ~500 tokens per file
|
||||
- 100 files = ~$0.005 (half a cent)
|
||||
|
||||
**Qwen (local)**:
|
||||
- Free if running locally
|
||||
- Slower than Gemini Flash
|
||||
|
||||
### When to Use Each Approach
|
||||
|
||||
| Use Case | Recommendation |
|
||||
|----------|----------------|
|
||||
| **Code pattern search** | Pure vector (e.g., "find all REST endpoints") |
|
||||
| **Natural language queries** | LLM-enhanced (e.g., "how to authenticate users") |
|
||||
| **Large codebase** | Pure vector first, LLM for important modules |
|
||||
| **Personal projects** | LLM-enhanced (cost is minimal) |
|
||||
| **Enterprise** | Hybrid approach |
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### LLM Config
|
||||
|
||||
```python
|
||||
from codexlens.semantic.llm_enhancer import LLMConfig, LLMEnhancer
|
||||
|
||||
config = LLMConfig(
|
||||
tool="gemini", # Primary LLM tool
|
||||
fallback_tool="qwen", # Fallback if primary fails
|
||||
timeout_ms=300000, # 5 minute timeout
|
||||
batch_size=5, # Files per batch
|
||||
max_content_chars=8000, # Max chars per file in prompt
|
||||
enabled=True, # Enable/disable LLM
|
||||
)
|
||||
|
||||
enhancer = LLMEnhancer(config)
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Override default LLM tool
|
||||
export CCW_CLI_SECONDARY_TOOL=gemini
|
||||
|
||||
# Override fallback tool
|
||||
export CCW_CLI_FALLBACK_TOOL=qwen
|
||||
|
||||
# Disable LLM enhancement (fall back to pure vector)
|
||||
export CODEXLENS_LLM_ENABLED=false
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: CCW CLI Not Found
|
||||
|
||||
**Error**: `CCW CLI not found in PATH, LLM enhancement disabled`
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Install CCW globally
|
||||
npm install -g ccw
|
||||
|
||||
# Verify installation
|
||||
ccw --version
|
||||
|
||||
# Check PATH
|
||||
which ccw # Unix
|
||||
where ccw # Windows
|
||||
```
|
||||
|
||||
### Issue 2: LLM API Errors
|
||||
|
||||
**Error**: `LLM call failed: HTTP 429 Too Many Requests`
|
||||
|
||||
**Solution**:
|
||||
- Reduce batch size in LLMConfig
|
||||
- Add delay between batches
|
||||
- Check API quota/limits
|
||||
- Try fallback tool (qwen)
|
||||
|
||||
### Issue 3: Poor LLM Summaries
|
||||
|
||||
**Symptom**: LLM summaries are too generic or inaccurate
|
||||
|
||||
**Solution**:
|
||||
- Try different LLM tool (gemini vs qwen)
|
||||
- Increase max_content_chars (default 8000)
|
||||
- Manually review and refine summaries
|
||||
- Fall back to pure vector for code-heavy files
|
||||
|
||||
### Issue 4: Slow Indexing
|
||||
|
||||
**Symptom**: Indexing takes too long with LLM enhancement
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Reduce batch size for faster feedback
|
||||
config = LLMConfig(batch_size=2) # Default is 5
|
||||
|
||||
# Or use pure vector for large files
|
||||
if file_size > 10000:
|
||||
use_pure_vector()
|
||||
else:
|
||||
use_llm_enhanced()
|
||||
```
|
||||
|
||||
## Example Test Queries
|
||||
|
||||
### Good for LLM-Enhanced Search
|
||||
|
||||
```python
|
||||
# Natural language, conceptual queries
|
||||
"How do I authenticate users with JWT?"
|
||||
"Validate email addresses before saving to database"
|
||||
"Secure password storage with hashing"
|
||||
"Create REST API endpoint for user registration"
|
||||
"Connect to PostgreSQL with connection pooling"
|
||||
```
|
||||
|
||||
### Good for Pure Vector Search
|
||||
|
||||
```python
|
||||
# Code-specific, pattern-matching queries
|
||||
"bcrypt.hashpw"
|
||||
"jwt.encode"
|
||||
"@app.route POST"
|
||||
"re.match email"
|
||||
"psycopg2.pool.SimpleConnectionPool"
|
||||
```
|
||||
|
||||
### Best: Combine Both
|
||||
|
||||
Use LLM-enhanced for high-level search, then pure vector for refinement:
|
||||
|
||||
```python
|
||||
# Step 1: LLM-enhanced for semantic search
|
||||
results = search_llm_enhanced("user authentication with tokens")
|
||||
# Returns: jwt_handler.py, password_hasher.py, user_endpoints.py
|
||||
|
||||
# Step 2: Pure vector for exact code pattern
|
||||
results = search_pure_vector("jwt.encode")
|
||||
# Returns: jwt_handler.py (exact match)
|
||||
```
|
||||
|
||||
## Future Improvements
|
||||
|
||||
- [ ] CLI integration for `--llm-enhanced` flag
|
||||
- [ ] Incremental LLM summary updates
|
||||
- [ ] Caching LLM summaries to reduce API calls
|
||||
- [ ] Hybrid search combining both approaches
|
||||
- [ ] Custom prompt templates for specific domains
|
||||
- [ ] Local LLM support (ollama, llama.cpp)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `PURE_VECTOR_SEARCH_GUIDE.md` - Pure vector search usage
|
||||
- `IMPLEMENTATION_SUMMARY.md` - Technical implementation details
|
||||
- `scripts/compare_search_methods.py` - Comparison test script
|
||||
- `tests/test_llm_enhanced_search.py` - Test suite
|
||||
|
||||
## References
|
||||
|
||||
- **LLM Enhancer Implementation**: `src/codexlens/semantic/llm_enhancer.py`
|
||||
- **CCW CLI Documentation**: https://github.com/anthropics/ccw
|
||||
- **Fastembed**: https://github.com/qdrant/fastembed
|
||||
|
||||
---
|
||||
|
||||
**Questions?** Run the comparison script to see LLM enhancement in action:
|
||||
```bash
|
||||
python scripts/compare_search_methods.py
|
||||
```
|
||||
Reference in New Issue
Block a user