mirror of https://github.com/catlog22/Claude-Code-Workflow.git synced 2026-02-10 02:24:35 +08:00

Files

catlog22 d21066c282 Add scripts for inspecting LLM summaries and testing misleading comments

- Implement `inspect_llm_summaries.py` to display LLM-generated summaries from the semantic_chunks table in the database.
- Create `show_llm_analysis.py` to demonstrate LLM analysis of misleading code examples, highlighting discrepancies between comments and actual functionality.
- Develop `test_misleading_comments.py` to compare pure vector search with LLM-enhanced search, focusing on the impact of misleading or missing comments on search results.
- Introduce `test_llm_enhanced_search.py` to provide a test suite for evaluating the effectiveness of LLM-enhanced vector search against pure vector search.
- Ensure all new scripts are integrated with the existing codebase and follow the established coding standards.

2025-12-16 20:29:28 +08:00

13 KiB

Raw Blame History

LLM-Enhanced Semantic Search Guide

Last Updated: 2025-12-16 Status: Experimental Feature

Overview

CodexLens supports two approaches for semantic vector search:

Approach	Pipeline	Best For
Pure Vector	Code → fastembed → search	Code pattern matching, exact functionality
LLM-Enhanced	Code → LLM summary → fastembed → search	Natural language queries, conceptual search

Why LLM Enhancement?

Problem: Raw code embeddings don't match natural language well.

Query: "How do I hash passwords securely?"
Raw code: def hash_password(password: str) -> str: ...
Mismatch: Low semantic similarity

Solution: LLM generates natural language summaries.

Query: "How do I hash passwords securely?"
LLM Summary: "Hash a password using bcrypt with specified salt rounds for secure storage"
Match: High semantic similarity ✓

Architecture

Pure Vector Search Flow

1. Code File
   └→ "def hash_password(password: str): ..."

2. Chunking
   └→ Split into semantic chunks (500-2000 chars)

3. Embedding (fastembed)
   └→ Generate 768-dim vector from raw code

4. Storage
   └→ Store vector in semantic_chunks table

5. Query
   └→ "How to hash passwords"
   └→ Generate query vector
   └→ Find similar vectors (cosine similarity)

Pros: Fast, no external dependencies, good for code patterns Cons: Poor semantic match for natural language queries

LLM-Enhanced Search Flow

1. Code File
   └→ "def hash_password(password: str): ..."

2. LLM Analysis (Gemini/Qwen via CCW)
   └→ Generate summary: "Hash a password using bcrypt..."
   └→ Extract keywords: ["password", "hash", "bcrypt", "security"]
   └→ Identify purpose: "auth"

3. Embeddable Text Creation
   └→ Combine: summary + keywords + purpose + filename

4. Embedding (fastembed)
   └→ Generate 768-dim vector from LLM text

5. Storage
   └→ Store vector with metadata

6. Query
   └→ "How to hash passwords"
   └→ Generate query vector
   └→ Find similar vectors → Better match! ✓

Pros: Excellent semantic match for natural language Cons: Slower, requires CCW CLI and LLM access

Setup Requirements

1. Install Dependencies

# Install semantic search dependencies
pip install codexlens[semantic]

# Install CCW CLI for LLM enhancement
npm install -g ccw

2. Configure LLM Tools

# Set primary LLM tool (default: gemini)
export CCW_CLI_SECONDARY_TOOL=gemini

# Set fallback tool (default: qwen)
export CCW_CLI_FALLBACK_TOOL=qwen

# Configure API keys (see CCW documentation)
ccw config set gemini.apiKey YOUR_API_KEY

3. Verify Setup

# Check CCW availability
ccw --version

# Check semantic dependencies
python -c "from codexlens.semantic import SEMANTIC_AVAILABLE; print(SEMANTIC_AVAILABLE)"

Running Comparison Tests

Method 1: Standalone Script (Recommended)

# Run full comparison (pure vector + LLM-enhanced)
python scripts/compare_search_methods.py

# Use specific LLM tool
python scripts/compare_search_methods.py --tool gemini
python scripts/compare_search_methods.py --tool qwen

# Skip LLM test (only pure vector)
python scripts/compare_search_methods.py --skip-llm

Output Example:

======================================================================
SEMANTIC SEARCH COMPARISON TEST
Pure Vector vs LLM-Enhanced Vector Search
======================================================================

Test dataset: 5 Python files
Test queries: 5 natural language questions

======================================================================
PURE VECTOR SEARCH (Code → fastembed)
======================================================================
Setup: 5 files, 23 chunks in 2.3s

Query                                        Top Result                     Score
----------------------------------------------------------------------
✓ How do I securely hash passwords?         password_hasher.py             0.723
✗ Generate JWT token for authentication      user_endpoints.py              0.645
✓ Create new user account via API            user_endpoints.py              0.812
✓ Validate email address format              validation.py                  0.756
~ Connect to PostgreSQL database             connection.py                  0.689

======================================================================
LLM-ENHANCED SEARCH (Code → GEMINI → fastembed)
======================================================================
Generating LLM summaries for 5 files...
Setup: 5/5 files indexed in 8.7s

Query                                        Top Result                     Score
----------------------------------------------------------------------
✓ How do I securely hash passwords?         password_hasher.py             0.891
✓ Generate JWT token for authentication      jwt_handler.py                 0.867
✓ Create new user account via API            user_endpoints.py              0.923
✓ Validate email address format              validation.py                  0.845
✓ Connect to PostgreSQL database             connection.py                  0.801

======================================================================
COMPARISON SUMMARY
======================================================================

Query                                        Pure       LLM
----------------------------------------------------------------------
How do I securely hash passwords?           ✓ Rank 1   ✓ Rank 1
Generate JWT token for authentication        ✗ Miss     ✓ Rank 1
Create new user account via API              ✓ Rank 1   ✓ Rank 1
Validate email address format                ✓ Rank 1   ✓ Rank 1
Connect to PostgreSQL database               ~ Rank 2   ✓ Rank 1
----------------------------------------------------------------------
TOTAL SCORE                                  11         15
======================================================================

ANALYSIS:
✓ LLM enhancement improves results by 36.4%
  Natural language summaries match queries better than raw code

Method 2: Pytest Test Suite

# Run full test suite
pytest tests/test_llm_enhanced_search.py -v -s

# Run specific test
pytest tests/test_llm_enhanced_search.py::TestSearchComparison::test_comparison -v -s

# Skip LLM tests if CCW not available
pytest tests/test_llm_enhanced_search.py -v -s -k "not llm_enhanced"

Using LLM Enhancement in Production

Option 1: Enhanced Embeddings Generation (Recommended)

Create embeddings with LLM enhancement during indexing:

from pathlib import Path
from codexlens.semantic.llm_enhancer import create_enhanced_indexer, FileData

# Create enhanced indexer
indexer = create_enhanced_indexer(
    vector_store_path=Path("~/.codexlens/indexes/project/_index.db"),
    llm_tool="gemini",
    llm_enabled=True,
)

# Prepare file data
files = [
    FileData(
        path="auth/password_hasher.py",
        content=open("auth/password_hasher.py").read(),
        language="python"
    ),
    # ... more files
]

# Index with LLM enhancement
indexed_count = indexer.index_files(files)
print(f"Indexed {indexed_count} files with LLM enhancement")

Option 2: CLI Integration (Coming Soon)

# Generate embeddings with LLM enhancement
codexlens embeddings-generate ~/projects/my-app --llm-enhanced --tool gemini

# Check which strategy was used
codexlens embeddings-status ~/projects/my-app --show-strategies

Note: CLI integration is planned but not yet implemented. Currently use Option 1 (Python API).

Option 3: Hybrid Approach

Combine both strategies for best results:

# Generate both pure and LLM-enhanced embeddings
# 1. Pure vector for exact code matching
generate_pure_embeddings(files)

# 2. LLM-enhanced for semantic matching
generate_llm_embeddings(files)

# Search uses both and ranks by best match

Performance Considerations

Speed Comparison

Approach	Indexing Time (100 files)	Query Time	Cost
Pure Vector	~30s	~50ms	Free
LLM-Enhanced	~5-10 min	~50ms	LLM API costs

LLM indexing is slower because:

Calls external LLM API (gemini/qwen)
Processes files in batches (default: 5 files/batch)
Waits for LLM response (~2-5s per batch)

Query speed is identical because:

Both use fastembed for similarity search
Vector lookup is same speed
Difference is only in what was embedded

Cost Estimation

Gemini Flash (via CCW):

~$0.10 per 1M input tokens
Average: ~500 tokens per file
100 files = ~$0.005 (half a cent)

Qwen (local):

Free if running locally
Slower than Gemini Flash

When to Use Each Approach

Use Case	Recommendation
Code pattern search	Pure vector (e.g., "find all REST endpoints")
Natural language queries	LLM-enhanced (e.g., "how to authenticate users")
Large codebase	Pure vector first, LLM for important modules
Personal projects	LLM-enhanced (cost is minimal)
Enterprise	Hybrid approach

Configuration Options

LLM Config

from codexlens.semantic.llm_enhancer import LLMConfig, LLMEnhancer

config = LLMConfig(
    tool="gemini",              # Primary LLM tool
    fallback_tool="qwen",       # Fallback if primary fails
    timeout_ms=300000,          # 5 minute timeout
    batch_size=5,               # Files per batch
    max_content_chars=8000,     # Max chars per file in prompt
    enabled=True,               # Enable/disable LLM
)

enhancer = LLMEnhancer(config)

Environment Variables

# Override default LLM tool
export CCW_CLI_SECONDARY_TOOL=gemini

# Override fallback tool
export CCW_CLI_FALLBACK_TOOL=qwen

# Disable LLM enhancement (fall back to pure vector)
export CODEXLENS_LLM_ENABLED=false

Troubleshooting

Issue 1: CCW CLI Not Found

Error: CCW CLI not found in PATH, LLM enhancement disabled

Solution:

# Install CCW globally
npm install -g ccw

# Verify installation
ccw --version

# Check PATH
which ccw  # Unix
where ccw  # Windows

Issue 2: LLM API Errors

Error: LLM call failed: HTTP 429 Too Many Requests

Solution:

Reduce batch size in LLMConfig
Add delay between batches
Check API quota/limits
Try fallback tool (qwen)

Issue 3: Poor LLM Summaries

Symptom: LLM summaries are too generic or inaccurate

Solution:

Try different LLM tool (gemini vs qwen)
Increase max_content_chars (default 8000)
Manually review and refine summaries
Fall back to pure vector for code-heavy files

Issue 4: Slow Indexing

Symptom: Indexing takes too long with LLM enhancement

Solution:

# Reduce batch size for faster feedback
config = LLMConfig(batch_size=2)  # Default is 5

# Or use pure vector for large files
if file_size > 10000:
    use_pure_vector()
else:
    use_llm_enhanced()

Example Test Queries

Good for LLM-Enhanced Search

# Natural language, conceptual queries
"How do I authenticate users with JWT?"
"Validate email addresses before saving to database"
"Secure password storage with hashing"
"Create REST API endpoint for user registration"
"Connect to PostgreSQL with connection pooling"

Good for Pure Vector Search

# Code-specific, pattern-matching queries
"bcrypt.hashpw"
"jwt.encode"
"@app.route POST"
"re.match email"
"psycopg2.pool.SimpleConnectionPool"

Best: Combine Both

Use LLM-enhanced for high-level search, then pure vector for refinement:

# Step 1: LLM-enhanced for semantic search
results = search_llm_enhanced("user authentication with tokens")
# Returns: jwt_handler.py, password_hasher.py, user_endpoints.py

# Step 2: Pure vector for exact code pattern
results = search_pure_vector("jwt.encode")
# Returns: jwt_handler.py (exact match)

Future Improvements

CLI integration for --llm-enhanced flag
Incremental LLM summary updates
Caching LLM summaries to reduce API calls
Hybrid search combining both approaches
Custom prompt templates for specific domains
Local LLM support (ollama, llama.cpp)

PURE_VECTOR_SEARCH_GUIDE.md - Pure vector search usage
IMPLEMENTATION_SUMMARY.md - Technical implementation details
scripts/compare_search_methods.py - Comparison test script
tests/test_llm_enhanced_search.py - Test suite

References

LLM Enhancer Implementation: src/codexlens/semantic/llm_enhancer.py
CCW CLI Documentation: https://github.com/anthropics/ccw
Fastembed: https://github.com/qdrant/fastembed

Questions? Run the comparison script to see LLM enhancement in action:

python scripts/compare_search_methods.py

13 KiB Raw Blame History

LLM-Enhanced Semantic Search Guide

Overview

Why LLM Enhancement?

Architecture

Pure Vector Search Flow

LLM-Enhanced Search Flow

Setup Requirements

1. Install Dependencies

2. Configure LLM Tools

3. Verify Setup

Running Comparison Tests

Method 1: Standalone Script (Recommended)

Method 2: Pytest Test Suite

Using LLM Enhancement in Production

Option 1: Enhanced Embeddings Generation (Recommended)

Option 2: CLI Integration (Coming Soon)

Option 3: Hybrid Approach

Performance Considerations

Speed Comparison

Cost Estimation

When to Use Each Approach

Configuration Options

LLM Config

Environment Variables

Troubleshooting

Issue 1: CCW CLI Not Found

Issue 2: LLM API Errors

Issue 3: Poor LLM Summaries

Issue 4: Slow Indexing

Example Test Queries

Good for LLM-Enhanced Search

Good for Pure Vector Search

Best: Combine Both

Future Improvements

Related Documentation

References

13 KiB

Raw Blame History