Files
Claude-Code-Workflow/codex-lens/docs/LLM_ENHANCED_SEARCH_GUIDE.md
catlog22 d21066c282 Add scripts for inspecting LLM summaries and testing misleading comments
- Implement `inspect_llm_summaries.py` to display LLM-generated summaries from the semantic_chunks table in the database.
- Create `show_llm_analysis.py` to demonstrate LLM analysis of misleading code examples, highlighting discrepancies between comments and actual functionality.
- Develop `test_misleading_comments.py` to compare pure vector search with LLM-enhanced search, focusing on the impact of misleading or missing comments on search results.
- Introduce `test_llm_enhanced_search.py` to provide a test suite for evaluating the effectiveness of LLM-enhanced vector search against pure vector search.
- Ensure all new scripts are integrated with the existing codebase and follow the established coding standards.
2025-12-16 20:29:28 +08:00

464 lines
13 KiB
Markdown

# LLM-Enhanced Semantic Search Guide
**Last Updated**: 2025-12-16
**Status**: Experimental Feature
---
## Overview
CodexLens supports two approaches for semantic vector search:
| Approach | Pipeline | Best For |
|----------|----------|----------|
| **Pure Vector** | Code → fastembed → search | Code pattern matching, exact functionality |
| **LLM-Enhanced** | Code → LLM summary → fastembed → search | Natural language queries, conceptual search |
### Why LLM Enhancement?
**Problem**: Raw code embeddings don't match natural language well.
```
Query: "How do I hash passwords securely?"
Raw code: def hash_password(password: str) -> str: ...
Mismatch: Low semantic similarity
```
**Solution**: LLM generates natural language summaries.
```
Query: "How do I hash passwords securely?"
LLM Summary: "Hash a password using bcrypt with specified salt rounds for secure storage"
Match: High semantic similarity ✓
```
## Architecture
### Pure Vector Search Flow
```
1. Code File
└→ "def hash_password(password: str): ..."
2. Chunking
└→ Split into semantic chunks (500-2000 chars)
3. Embedding (fastembed)
└→ Generate 768-dim vector from raw code
4. Storage
└→ Store vector in semantic_chunks table
5. Query
└→ "How to hash passwords"
└→ Generate query vector
└→ Find similar vectors (cosine similarity)
```
**Pros**: Fast, no external dependencies, good for code patterns
**Cons**: Poor semantic match for natural language queries
### LLM-Enhanced Search Flow
```
1. Code File
└→ "def hash_password(password: str): ..."
2. LLM Analysis (Gemini/Qwen via CCW)
└→ Generate summary: "Hash a password using bcrypt..."
└→ Extract keywords: ["password", "hash", "bcrypt", "security"]
└→ Identify purpose: "auth"
3. Embeddable Text Creation
└→ Combine: summary + keywords + purpose + filename
4. Embedding (fastembed)
└→ Generate 768-dim vector from LLM text
5. Storage
└→ Store vector with metadata
6. Query
└→ "How to hash passwords"
└→ Generate query vector
└→ Find similar vectors → Better match! ✓
```
**Pros**: Excellent semantic match for natural language
**Cons**: Slower, requires CCW CLI and LLM access
## Setup Requirements
### 1. Install Dependencies
```bash
# Install semantic search dependencies
pip install codexlens[semantic]
# Install CCW CLI for LLM enhancement
npm install -g ccw
```
### 2. Configure LLM Tools
```bash
# Set primary LLM tool (default: gemini)
export CCW_CLI_SECONDARY_TOOL=gemini
# Set fallback tool (default: qwen)
export CCW_CLI_FALLBACK_TOOL=qwen
# Configure API keys (see CCW documentation)
ccw config set gemini.apiKey YOUR_API_KEY
```
### 3. Verify Setup
```bash
# Check CCW availability
ccw --version
# Check semantic dependencies
python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"
```
## Running Comparison Tests
### Method 1: Standalone Script (Recommended)
```bash
# Run full comparison (pure vector + LLM-enhanced)
python scripts/compare_search_methods.py
# Use specific LLM tool
python scripts/compare_search_methods.py --tool gemini
python scripts/compare_search_methods.py --tool qwen
# Skip LLM test (only pure vector)
python scripts/compare_search_methods.py --skip-llm
```
**Output Example**:
```
======================================================================
SEMANTIC SEARCH COMPARISON TEST
Pure Vector vs LLM-Enhanced Vector Search
======================================================================
Test dataset: 5 Python files
Test queries: 5 natural language questions
======================================================================
PURE VECTOR SEARCH (Code → fastembed)
======================================================================
Setup: 5 files, 23 chunks in 2.3s
Query Top Result Score
----------------------------------------------------------------------
✓ How do I securely hash passwords? password_hasher.py 0.723
✗ Generate JWT token for authentication user_endpoints.py 0.645
✓ Create new user account via API user_endpoints.py 0.812
✓ Validate email address format validation.py 0.756
~ Connect to PostgreSQL database connection.py 0.689
======================================================================
LLM-ENHANCED SEARCH (Code → GEMINI → fastembed)
======================================================================
Generating LLM summaries for 5 files...
Setup: 5/5 files indexed in 8.7s
Query Top Result Score
----------------------------------------------------------------------
✓ How do I securely hash passwords? password_hasher.py 0.891
✓ Generate JWT token for authentication jwt_handler.py 0.867
✓ Create new user account via API user_endpoints.py 0.923
✓ Validate email address format validation.py 0.845
✓ Connect to PostgreSQL database connection.py 0.801
======================================================================
COMPARISON SUMMARY
======================================================================
Query Pure LLM
----------------------------------------------------------------------
How do I securely hash passwords? ✓ Rank 1 ✓ Rank 1
Generate JWT token for authentication ✗ Miss ✓ Rank 1
Create new user account via API ✓ Rank 1 ✓ Rank 1
Validate email address format ✓ Rank 1 ✓ Rank 1
Connect to PostgreSQL database ~ Rank 2 ✓ Rank 1
----------------------------------------------------------------------
TOTAL SCORE 11 15
======================================================================
ANALYSIS:
✓ LLM enhancement improves results by 36.4%
Natural language summaries match queries better than raw code
```
### Method 2: Pytest Test Suite
```bash
# Run full test suite
pytest tests/test_llm_enhanced_search.py -v -s
# Run specific test
pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s
# Skip LLM tests if CCW not available
pytest tests/test_llm_enhanced_search.py -v -s -k "not llm_enhanced"
```
## Using LLM Enhancement in Production
### Option 1: Enhanced Embeddings Generation (Recommended)
Create embeddings with LLM enhancement during indexing:
```python
from pathlib import Path
from codexlens.semantic.llm_enhancer import create_enhanced_indexer, FileData
# Create enhanced indexer
indexer = create_enhanced_indexer(
vector_store_path=Path("~/.codexlens/indexes/project/_index.db"),
llm_tool="gemini",
llm_enabled=True,
)
# Prepare file data
files = [
FileData(
path="auth/password_hasher.py",
content=open("auth/password_hasher.py").read(),
language="python"
),
# ... more files
]
# Index with LLM enhancement
indexed_count = indexer.index_files(files)
print(f"Indexed {indexed_count} files with LLM enhancement")
```
### Option 2: CLI Integration (Coming Soon)
```bash
# Generate embeddings with LLM enhancement
codexlens embeddings-generate ~/projects/my-app --llm-enhanced --tool gemini
# Check which strategy was used
codexlens embeddings-status ~/projects/my-app --show-strategies
```
**Note**: CLI integration is planned but not yet implemented. Currently use Option 1 (Python API).
### Option 3: Hybrid Approach
Combine both strategies for best results:
```python
# Generate both pure and LLM-enhanced embeddings
# 1. Pure vector for exact code matching
generate_pure_embeddings(files)
# 2. LLM-enhanced for semantic matching
generate_llm_embeddings(files)
# Search uses both and ranks by best match
```
## Performance Considerations
### Speed Comparison
| Approach | Indexing Time (100 files) | Query Time | Cost |
|----------|---------------------------|------------|------|
| Pure Vector | ~30s | ~50ms | Free |
| LLM-Enhanced | ~5-10 min | ~50ms | LLM API costs |
**LLM indexing is slower** because:
- Calls external LLM API (gemini/qwen)
- Processes files in batches (default: 5 files/batch)
- Waits for LLM response (~2-5s per batch)
**Query speed is identical** because:
- Both use fastembed for similarity search
- Vector lookup is same speed
- Difference is only in what was embedded
### Cost Estimation
**Gemini Flash (via CCW)**:
- ~$0.10 per 1M input tokens
- Average: ~500 tokens per file
- 100 files = ~$0.005 (half a cent)
**Qwen (local)**:
- Free if running locally
- Slower than Gemini Flash
### When to Use Each Approach
| Use Case | Recommendation |
|----------|----------------|
| **Code pattern search** | Pure vector (e.g., "find all REST endpoints") |
| **Natural language queries** | LLM-enhanced (e.g., "how to authenticate users") |
| **Large codebase** | Pure vector first, LLM for important modules |
| **Personal projects** | LLM-enhanced (cost is minimal) |
| **Enterprise** | Hybrid approach |
## Configuration Options
### LLM Config
```python
from codexlens.semantic.llm_enhancer import LLMConfig, LLMEnhancer
config = LLMConfig(
tool="gemini", # Primary LLM tool
fallback_tool="qwen", # Fallback if primary fails
timeout_ms=300000, # 5 minute timeout
batch_size=5, # Files per batch
max_content_chars=8000, # Max chars per file in prompt
enabled=True, # Enable/disable LLM
)
enhancer = LLMEnhancer(config)
```
### Environment Variables
```bash
# Override default LLM tool
export CCW_CLI_SECONDARY_TOOL=gemini
# Override fallback tool
export CCW_CLI_FALLBACK_TOOL=qwen
# Disable LLM enhancement (fall back to pure vector)
export CODEXLENS_LLM_ENABLED=false
```
## Troubleshooting
### Issue 1: CCW CLI Not Found
**Error**: `CCW CLI not found in PATH, LLM enhancement disabled`
**Solution**:
```bash
# Install CCW globally
npm install -g ccw
# Verify installation
ccw --version
# Check PATH
which ccw # Unix
where ccw # Windows
```
### Issue 2: LLM API Errors
**Error**: `LLM call failed: HTTP 429 Too Many Requests`
**Solution**:
- Reduce batch size in LLMConfig
- Add delay between batches
- Check API quota/limits
- Try fallback tool (qwen)
### Issue 3: Poor LLM Summaries
**Symptom**: LLM summaries are too generic or inaccurate
**Solution**:
- Try different LLM tool (gemini vs qwen)
- Increase max_content_chars (default 8000)
- Manually review and refine summaries
- Fall back to pure vector for code-heavy files
### Issue 4: Slow Indexing
**Symptom**: Indexing takes too long with LLM enhancement
**Solution**:
```python
# Reduce batch size for faster feedback
config = LLMConfig(batch_size=2) # Default is 5
# Or use pure vector for large files
if file_size > 10000:
use_pure_vector()
else:
use_llm_enhanced()
```
## Example Test Queries
### Good for LLM-Enhanced Search
```python
# Natural language, conceptual queries
"How do I authenticate users with JWT?"
"Validate email addresses before saving to database"
"Secure password storage with hashing"
"Create REST API endpoint for user registration"
"Connect to PostgreSQL with connection pooling"
```
### Good for Pure Vector Search
```python
# Code-specific, pattern-matching queries
"bcrypt.hashpw"
"jwt.encode"
"@app.route POST"
"re.match email"
"psycopg2.pool.SimpleConnectionPool"
```
### Best: Combine Both
Use LLM-enhanced for high-level search, then pure vector for refinement:
```python
# Step 1: LLM-enhanced for semantic search
results = search_llm_enhanced("user authentication with tokens")
# Returns: jwt_handler.py, password_hasher.py, user_endpoints.py
# Step 2: Pure vector for exact code pattern
results = search_pure_vector("jwt.encode")
# Returns: jwt_handler.py (exact match)
```
## Future Improvements
- [ ] CLI integration for `--llm-enhanced` flag
- [ ] Incremental LLM summary updates
- [ ] Caching LLM summaries to reduce API calls
- [ ] Hybrid search combining both approaches
- [ ] Custom prompt templates for specific domains
- [ ] Local LLM support (ollama, llama.cpp)
## Related Documentation
- `PURE_VECTOR_SEARCH_GUIDE.md` - Pure vector search usage
- `IMPLEMENTATION_SUMMARY.md` - Technical implementation details
- `scripts/compare_search_methods.py` - Comparison test script
- `tests/test_llm_enhanced_search.py` - Test suite
## References
- **LLM Enhancer Implementation**: `src/codexlens/semantic/llm_enhancer.py`
- **CCW CLI Documentation**: https://github.com/anthropics/ccw
- **Fastembed**: https://github.com/qdrant/fastembed
---
**Questions?** Run the comparison script to see LLM enhancement in action:
```bash
python scripts/compare_search_methods.py
```