- Implement `inspect_llm_summaries.py` to display LLM-generated summaries from the semantic_chunks table in the database. - Create `show_llm_analysis.py` to demonstrate LLM analysis of misleading code examples, highlighting discrepancies between comments and actual functionality. - Develop `test_misleading_comments.py` to compare pure vector search with LLM-enhanced search, focusing on the impact of misleading or missing comments on search results. - Introduce `test_llm_enhanced_search.py` to provide a test suite for evaluating the effectiveness of LLM-enhanced vector search against pure vector search. - Ensure all new scripts are integrated with the existing codebase and follow the established coding standards.
13 KiB
LLM-Enhanced Semantic Search Guide
Last Updated: 2025-12-16 Status: Experimental Feature
Overview
CodexLens supports two approaches for semantic vector search:
| Approach | Pipeline | Best For |
|---|---|---|
| Pure Vector | Code → fastembed → search | Code pattern matching, exact functionality |
| LLM-Enhanced | Code → LLM summary → fastembed → search | Natural language queries, conceptual search |
Why LLM Enhancement?
Problem: Raw code embeddings don't match natural language well.
Query: "How do I hash passwords securely?"
Raw code: def hash_password(password: str) -> str: ...
Mismatch: Low semantic similarity
Solution: LLM generates natural language summaries.
Query: "How do I hash passwords securely?"
LLM Summary: "Hash a password using bcrypt with specified salt rounds for secure storage"
Match: High semantic similarity ✓
Architecture
Pure Vector Search Flow
1. Code File
└→ "def hash_password(password: str): ..."
2. Chunking
└→ Split into semantic chunks (500-2000 chars)
3. Embedding (fastembed)
└→ Generate 768-dim vector from raw code
4. Storage
└→ Store vector in semantic_chunks table
5. Query
└→ "How to hash passwords"
└→ Generate query vector
└→ Find similar vectors (cosine similarity)
Pros: Fast, no external dependencies, good for code patterns Cons: Poor semantic match for natural language queries
LLM-Enhanced Search Flow
1. Code File
└→ "def hash_password(password: str): ..."
2. LLM Analysis (Gemini/Qwen via CCW)
└→ Generate summary: "Hash a password using bcrypt..."
└→ Extract keywords: ["password", "hash", "bcrypt", "security"]
└→ Identify purpose: "auth"
3. Embeddable Text Creation
└→ Combine: summary + keywords + purpose + filename
4. Embedding (fastembed)
└→ Generate 768-dim vector from LLM text
5. Storage
└→ Store vector with metadata
6. Query
└→ "How to hash passwords"
└→ Generate query vector
└→ Find similar vectors → Better match! ✓
Pros: Excellent semantic match for natural language Cons: Slower, requires CCW CLI and LLM access
Setup Requirements
1. Install Dependencies
# Install semantic search dependencies
pip install codexlens[semantic]
# Install CCW CLI for LLM enhancement
npm install -g ccw
2. Configure LLM Tools
# Set primary LLM tool (default: gemini)
export CCW_CLI_SECONDARY_TOOL=gemini
# Set fallback tool (default: qwen)
export CCW_CLI_FALLBACK_TOOL=qwen
# Configure API keys (see CCW documentation)
ccw config set gemini.apiKey YOUR_API_KEY
3. Verify Setup
# Check CCW availability
ccw --version
# Check semantic dependencies
python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"
Running Comparison Tests
Method 1: Standalone Script (Recommended)
# Run full comparison (pure vector + LLM-enhanced)
python scripts/compare_search_methods.py
# Use specific LLM tool
python scripts/compare_search_methods.py --tool gemini
python scripts/compare_search_methods.py --tool qwen
# Skip LLM test (only pure vector)
python scripts/compare_search_methods.py --skip-llm
Output Example:
======================================================================
SEMANTIC SEARCH COMPARISON TEST
Pure Vector vs LLM-Enhanced Vector Search
======================================================================
Test dataset: 5 Python files
Test queries: 5 natural language questions
======================================================================
PURE VECTOR SEARCH (Code → fastembed)
======================================================================
Setup: 5 files, 23 chunks in 2.3s
Query Top Result Score
----------------------------------------------------------------------
✓ How do I securely hash passwords? password_hasher.py 0.723
✗ Generate JWT token for authentication user_endpoints.py 0.645
✓ Create new user account via API user_endpoints.py 0.812
✓ Validate email address format validation.py 0.756
~ Connect to PostgreSQL database connection.py 0.689
======================================================================
LLM-ENHANCED SEARCH (Code → GEMINI → fastembed)
======================================================================
Generating LLM summaries for 5 files...
Setup: 5/5 files indexed in 8.7s
Query Top Result Score
----------------------------------------------------------------------
✓ How do I securely hash passwords? password_hasher.py 0.891
✓ Generate JWT token for authentication jwt_handler.py 0.867
✓ Create new user account via API user_endpoints.py 0.923
✓ Validate email address format validation.py 0.845
✓ Connect to PostgreSQL database connection.py 0.801
======================================================================
COMPARISON SUMMARY
======================================================================
Query Pure LLM
----------------------------------------------------------------------
How do I securely hash passwords? ✓ Rank 1 ✓ Rank 1
Generate JWT token for authentication ✗ Miss ✓ Rank 1
Create new user account via API ✓ Rank 1 ✓ Rank 1
Validate email address format ✓ Rank 1 ✓ Rank 1
Connect to PostgreSQL database ~ Rank 2 ✓ Rank 1
----------------------------------------------------------------------
TOTAL SCORE 11 15
======================================================================
ANALYSIS:
✓ LLM enhancement improves results by 36.4%
Natural language summaries match queries better than raw code
Method 2: Pytest Test Suite
# Run full test suite
pytest tests/test_llm_enhanced_search.py -v -s
# Run specific test
pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s
# Skip LLM tests if CCW not available
pytest tests/test_llm_enhanced_search.py -v -s -k "not llm_enhanced"
Using LLM Enhancement in Production
Option 1: Enhanced Embeddings Generation (Recommended)
Create embeddings with LLM enhancement during indexing:
from pathlib import Path
from codexlens.semantic.llm_enhancer import create_enhanced_indexer, FileData
# Create enhanced indexer
indexer = create_enhanced_indexer(
vector_store_path=Path("~/.codexlens/indexes/project/_index.db"),
llm_tool="gemini",
llm_enabled=True,
)
# Prepare file data
files = [
FileData(
path="auth/password_hasher.py",
content=open("auth/password_hasher.py").read(),
language="python"
),
# ... more files
]
# Index with LLM enhancement
indexed_count = indexer.index_files(files)
print(f"Indexed {indexed_count} files with LLM enhancement")
Option 2: CLI Integration (Coming Soon)
# Generate embeddings with LLM enhancement
codexlens embeddings-generate ~/projects/my-app --llm-enhanced --tool gemini
# Check which strategy was used
codexlens embeddings-status ~/projects/my-app --show-strategies
Note: CLI integration is planned but not yet implemented. Currently use Option 1 (Python API).
Option 3: Hybrid Approach
Combine both strategies for best results:
# Generate both pure and LLM-enhanced embeddings
# 1. Pure vector for exact code matching
generate_pure_embeddings(files)
# 2. LLM-enhanced for semantic matching
generate_llm_embeddings(files)
# Search uses both and ranks by best match
Performance Considerations
Speed Comparison
| Approach | Indexing Time (100 files) | Query Time | Cost |
|---|---|---|---|
| Pure Vector | ~30s | ~50ms | Free |
| LLM-Enhanced | ~5-10 min | ~50ms | LLM API costs |
LLM indexing is slower because:
- Calls external LLM API (gemini/qwen)
- Processes files in batches (default: 5 files/batch)
- Waits for LLM response (~2-5s per batch)
Query speed is identical because:
- Both use fastembed for similarity search
- Vector lookup is same speed
- Difference is only in what was embedded
Cost Estimation
Gemini Flash (via CCW):
- ~$0.10 per 1M input tokens
- Average: ~500 tokens per file
- 100 files = ~$0.005 (half a cent)
Qwen (local):
- Free if running locally
- Slower than Gemini Flash
When to Use Each Approach
| Use Case | Recommendation |
|---|---|
| Code pattern search | Pure vector (e.g., "find all REST endpoints") |
| Natural language queries | LLM-enhanced (e.g., "how to authenticate users") |
| Large codebase | Pure vector first, LLM for important modules |
| Personal projects | LLM-enhanced (cost is minimal) |
| Enterprise | Hybrid approach |
Configuration Options
LLM Config
from codexlens.semantic.llm_enhancer import LLMConfig, LLMEnhancer
config = LLMConfig(
tool="gemini", # Primary LLM tool
fallback_tool="qwen", # Fallback if primary fails
timeout_ms=300000, # 5 minute timeout
batch_size=5, # Files per batch
max_content_chars=8000, # Max chars per file in prompt
enabled=True, # Enable/disable LLM
)
enhancer = LLMEnhancer(config)
Environment Variables
# Override default LLM tool
export CCW_CLI_SECONDARY_TOOL=gemini
# Override fallback tool
export CCW_CLI_FALLBACK_TOOL=qwen
# Disable LLM enhancement (fall back to pure vector)
export CODEXLENS_LLM_ENABLED=false
Troubleshooting
Issue 1: CCW CLI Not Found
Error: CCW CLI not found in PATH, LLM enhancement disabled
Solution:
# Install CCW globally
npm install -g ccw
# Verify installation
ccw --version
# Check PATH
which ccw # Unix
where ccw # Windows
Issue 2: LLM API Errors
Error: LLM call failed: HTTP 429 Too Many Requests
Solution:
- Reduce batch size in LLMConfig
- Add delay between batches
- Check API quota/limits
- Try fallback tool (qwen)
Issue 3: Poor LLM Summaries
Symptom: LLM summaries are too generic or inaccurate
Solution:
- Try different LLM tool (gemini vs qwen)
- Increase max_content_chars (default 8000)
- Manually review and refine summaries
- Fall back to pure vector for code-heavy files
Issue 4: Slow Indexing
Symptom: Indexing takes too long with LLM enhancement
Solution:
# Reduce batch size for faster feedback
config = LLMConfig(batch_size=2) # Default is 5
# Or use pure vector for large files
if file_size > 10000:
use_pure_vector()
else:
use_llm_enhanced()
Example Test Queries
Good for LLM-Enhanced Search
# Natural language, conceptual queries
"How do I authenticate users with JWT?"
"Validate email addresses before saving to database"
"Secure password storage with hashing"
"Create REST API endpoint for user registration"
"Connect to PostgreSQL with connection pooling"
Good for Pure Vector Search
# Code-specific, pattern-matching queries
"bcrypt.hashpw"
"jwt.encode"
"@app.route POST"
"re.match email"
"psycopg2.pool.SimpleConnectionPool"
Best: Combine Both
Use LLM-enhanced for high-level search, then pure vector for refinement:
# Step 1: LLM-enhanced for semantic search
results = search_llm_enhanced("user authentication with tokens")
# Returns: jwt_handler.py, password_hasher.py, user_endpoints.py
# Step 2: Pure vector for exact code pattern
results = search_pure_vector("jwt.encode")
# Returns: jwt_handler.py (exact match)
Future Improvements
- CLI integration for
--llm-enhancedflag - Incremental LLM summary updates
- Caching LLM summaries to reduce API calls
- Hybrid search combining both approaches
- Custom prompt templates for specific domains
- Local LLM support (ollama, llama.cpp)
Related Documentation
PURE_VECTOR_SEARCH_GUIDE.md- Pure vector search usageIMPLEMENTATION_SUMMARY.md- Technical implementation detailsscripts/compare_search_methods.py- Comparison test scripttests/test_llm_enhanced_search.py- Test suite
References
- LLM Enhancer Implementation:
src/codexlens/semantic/llm_enhancer.py - CCW CLI Documentation: https://github.com/anthropics/ccw
- Fastembed: https://github.com/qdrant/fastembed
Questions? Run the comparison script to see LLM enhancement in action:
python scripts/compare_search_methods.py