Add scripts for inspecting LLM summaries and testing misleading comments

- Implement `inspect_llm_summaries.py` to display LLM-generated summaries from the semantic_chunks table in the database. - Create `show_llm_analysis.py` to demonstrate LLM analysis of misleading code examples, highlighting discrepancies between comments and actual functionality. - Develop `test_misleading_comments.py` to compare pure vector search with LLM-enhanced search, focusing on the impact of misleading or missing comments on search results. - Introduce `test_llm_enhanced_search.py` to provide a test suite for evaluating the effectiveness of LLM-enhanced vector search against pure vector search. - Ensure all new scripts are integrated with the existing codebase and follow the established coding standards.
2026-02-12 02:37:45 +08:00 · 2025-12-16 20:29:28 +08:00
parent df23975a0b
commit d21066c282
14 changed files with 3170 additions and 57 deletions
--- a/codex-lens/docs/MISLEADING_COMMENTS_TEST_RESULTS.md
+++ b/codex-lens/docs/MISLEADING_COMMENTS_TEST_RESULTS.md
@@ -0,0 +1,301 @@
+# 误导性注释测试结果
+
+**测试日期**: 2025-12-16
+**测试目的**: 验证LLM增强搜索是否能克服错误/缺失的代码注释
+
+---
+
+## 📊 测试结果总结
+
+### 性能对比
+
+| 方法 | 索引时间 | 准确率 | 得分 | 结论 |
+|------|---------|--------|------|------|
+| **纯向量搜索** | 2.1秒 | 5/5 (100%) | 15/15 | ✅ 未被误导性注释影响 |
+| **LLM增强搜索** | 103.7秒 | 5/5 (100%) | 15/15 | ✅ 正确识别实际功能 |
+
+**结论**: 平局 - 两种方法都能正确处理误导性注释
+
+---
+
+## 🧪 测试数据集设计
+
+### 误导性代码样本 (5个文件)
+
+| 文件 | 错误注释 | 实际功能 | 误导程度 |
+|------|---------|---------|---------|
+| `crypto/hasher.py` | "Simple string utilities" | bcrypt密码哈希 | 高 |
+| `auth/token.py` | 无注释，模糊函数名 | JWT令牌生成 | 中 |
+| `api/handlers.py` | "Database utilities", 反向docstrings | REST API用户管理 | 极高 |
+| `utils/checker.py` | "Math calculation functions" | 邮箱地址验证 | 高 |
+| `db/pool.py` | "Email sending service" | PostgreSQL连接池 | 极高 |
+
+### 具体误导示例
+
+#### 示例 1: 完全错误的模块描述
+
+```python
+"""Email sending service."""  # 错误！
+import psycopg2  # 实际是数据库库
+from psycopg2 import pool
+
+class EmailSender:  # 错误的类名
+    """SMTP email sender with retry logic."""  # 错误！
+
+    def __init__(self, min_conn: int = 1, max_conn: int = 10):
+        """Initialize email sender."""  # 错误！
+        self.pool = psycopg2.pool.SimpleConnectionPool(...)  # 实际是DB连接池
+```
+
+**实际功能**: PostgreSQL数据库连接池管理器
+**注释声称**: SMTP邮件发送服务
+
+#### 示例 2: 反向的函数文档
+
+```python
+@app.route('/api/items', methods=['POST'])
+def create_item():
+    """Delete an existing item."""  # 完全相反！
+    data = request.get_json()
+    # 实际是创建新项目
+    return jsonify({'item_id': item_id}), 201
+```
+
+### 测试查询 (基于实际功能)
+
+| 查询 | 预期文件 | 查询难度 |
+|------|---------|---------|
+| "Hash passwords securely with bcrypt" | `crypto/hasher.py` | 高 - 注释说string utils |
+| "Generate JWT authentication token" | `auth/token.py` | 中 - 无注释 |
+| "Create user account REST API endpoint" | `api/handlers.py` | 高 - 注释说database |
+| "Validate email address format" | `utils/checker.py` | 高 - 注释说math |
+| "PostgreSQL database connection pool" | `db/pool.py` | 极高 - 注释说email |
+
+---
+
+## 🔍 LLM分析能力验证
+
+### 直接测试: LLM如何理解误导性代码
+
+**测试代码**: `db/pool.py` (声称是"Email sending service")
+
+**Gemini分析结果**:
+
+```
+Summary: This Python module defines an `EmailSender` class that manages
+a PostgreSQL connection pool for an email sending service, using
+`psycopg2` for database interactions. It provides a context manager
+`send_email` to handle connection acquisition, transaction commitment,
+and release back to the pool.
+
+Purpose: data
+
+Keywords: psycopg2, connection pool, PostgreSQL, database, email sender,
+context manager, python, database connection, transaction
+```
+
+**分析得分**:
+- ✅ **正确识别的术语** (5/5): PostgreSQL, connection pool, database, psycopg2, database connection
+- ⚠️ **误导性术语** (2/3): email sender, email sending service (但上下文正确)
+
+**结论**: LLM正确识别了实际功能（PostgreSQL connection pool），虽然摘要开头提到了错误的module docstring，但核心描述准确。
+
+---
+
+## 💡 关键发现
+
+### 1. 为什么纯向量搜索也能工作？
+
+**原因**: 代码中的技术关键词权重高于注释
+
+```python
+# 这些强信号即使有错误注释也能正确匹配
+import bcrypt          # 强信号: 密码哈希
+import jwt             # 强信号: JWT令牌
+import psycopg2        # 强信号: PostgreSQL
+from flask import Flask, request  # 强信号: REST API
+pattern = r'^[a-zA-Z0-9._%+-]+@'  # 强信号: 邮箱验证
+```
+
+**嵌入模型的优势**:
+- 代码标识符（bcrypt, jwt, psycopg2）具有高度特异性
+- import语句权重高
+- 正则表达式模式具有语义信息
+- 框架API调用（Flask路由）提供明确上下文
+
+### 2. LLM增强的价值
+
+**LLM分析过程**:
+1. ✅ 读取代码逻辑（不仅仅是注释）
+2. ✅ 识别import语句和实际使用
+3. ✅ 理解代码流程和数据流
+4. ✅ 生成基于行为的摘要
+5. ⚠️ 部分参考错误注释（但不完全依赖）
+
+**示例对比**:
+
+| 方面 | 纯向量 | LLM增强 |
+|------|--------|---------|
+| **处理内容** | 代码 + 注释 (整体嵌入) | 代码分析 → 生成摘要 |
+| **误导性注释影响** | 低 (代码关键词权重高) | 极低 (理解代码逻辑) |
+| **自然语言查询** | 依赖代码词汇匹配 | 理解语义意图 |
+| **处理速度** | 快 (2秒) | 慢 (104秒, 52倍差) |
+
+### 3. 测试数据集的局限性
+
+**为什么两种方法都表现完美**:
+
+1. **文件数量太少** (5个文件)
+   - 没有相似功能的文件竞争
+   - 每个查询有唯一的目标文件
+
+2. **代码关键词太强**
+   - bcrypt → 唯一用于密码
+   - jwt → 唯一用于令牌
+   - Flask+@app.route → 唯一的API
+   - psycopg2 → 唯一的数据库
+
+3. **查询过于具体**
+   - "bcrypt password hashing" 直接匹配代码关键词
+   - 不是概念性或模糊查询
+
+**理想的测试场景**:
+- ❌ 5个唯一功能文件
+- ✅ 100+文件，多个相似功能模块
+- ✅ 模糊概念查询: "用户认证"而不是"bcrypt hash"
+- ✅ 没有明显关键词的业务逻辑代码
+
+---
+
+## 🎯 实际应用建议
+
+### 何时使用纯向量搜索
+
+✅ **推荐场景**:
+- 代码库有良好文档
+- 搜索代码模式和API使用
+- 已知技术栈关键词
+- 需要快速索引
+
+**示例查询**:
+- "bcrypt.hashpw usage"
+- "Flask @app.route GET method"
+- "jwt.encode algorithm"
+
+### 何时使用LLM增强搜索
+
+✅ **推荐场景**:
+- 代码库文档缺失或过时
+- 自然语言概念性查询
+- 业务逻辑搜索
+- 重视搜索准确性 > 索引速度
+
+**示例查询**:
+- "How to authenticate users?" (概念性)
+- "Payment processing workflow" (业务逻辑)
+- "Error handling for API requests" (模式搜索)
+
+### 混合策略 (推荐)
+
+| 模块类型 | 索引方式 | 原因 |
+|---------|---------|------|
+| **核心业务逻辑** | LLM增强 | 复杂逻辑，文档可能不完整 |
+| **工具函数** | 纯向量 | 代码清晰，关键词明确 |
+| **第三方集成** | 纯向量 | API调用已是最好描述 |
+| **遗留代码** | LLM增强 | 文档陈旧或缺失 |
+
+---
+
+## 📈 性能与成本
+
+### 时间成本
+
+| 操作 | 纯向量 | LLM增强 | 差异 |
+|------|--------|---------|------|
+| **索引5文件** | 2.1秒 | 103.7秒 | 49倍慢 |
+| **索引100文件** | ~42秒 | ~35分钟 | ~50倍慢 |
+| **查询速度** | ~50ms | ~50ms | 相同 |
+
+### 金钱成本 (Gemini Flash)
+
+- **价格**: $0.10 / 1M input tokens
+- **平均**: ~500 tokens / 文件
+- **100文件**: $0.005 (半分钱)
+- **1000文件**: $0.05 (5分钱)
+
+**结论**: 金钱成本可忽略，时间成本是主要考虑因素
+
+---
+
+## 🧪 测试工具
+
+### 创建的脚本
+
+1. **`scripts/test_misleading_comments.py`**
+   - 完整对比测试
+   - 支持 `--tool gemini|qwen`
+   - 支持 `--keep-db` 保存结果数据库
+
+2. **`scripts/show_llm_analysis.py`**
+   - 直接显示LLM对单个文件的分析
+   - 评估LLM是否被误导
+   - 计算正确/误导术语比例
+
+3. **`scripts/inspect_llm_summaries.py`**
+   - 检查数据库中的LLM摘要
+   - 查看metadata和keywords
+
+### 运行测试
+
+```bash
+# 完整对比测试
+python scripts/test_misleading_comments.py --tool gemini
+
+# 保存数据库用于检查
+python scripts/test_misleading_comments.py --keep-db ./results.db
+
+# 查看LLM对单个文件的分析
+python scripts/show_llm_analysis.py
+
+# 检查数据库中的摘要
+python scripts/inspect_llm_summaries.py results.db
+```
+
+---
+
+## 📝 结论
+
+### 测试结论
+
+1. ✅ **LLM能够克服误导性注释**
+   - 正确识别实际代码功能
+   - 生成基于行为的准确摘要
+   - 不完全依赖文档字符串
+
+2. ✅ **纯向量搜索也具有抗干扰能力**
+   - 代码关键词提供强信号
+   - 技术栈名称具有高特异性
+   - import语句和API调用信息丰富
+
+3. ⚠️ **当前测试数据集太简单**
+   - 需要更大规模测试 (100+文件)
+   - 需要概念性查询测试
+   - 需要相似功能模块对比
+
+### 生产使用建议
+
+**最佳实践**: 根据代码库特征选择策略
+
+| 代码库特征 | 推荐方案 | 理由 |
+|-----------|---------|------|
+| 良好文档，清晰命名 | 纯向量 | 快速，成本低 |
+| 文档缺失/陈旧 | LLM增强 | 理解代码逻辑 |
+| 遗留系统 | LLM增强 | 克服历史包袱 |
+| 新项目 | 纯向量 | 现代代码通常更清晰 |
+| 大型企业代码库 | 混合 | 分模块策略 |
+
+---
+
+**测试完成时间**: 2025-12-16
+**测试工具**: Gemini Flash 2.5, fastembed (BAAI/bge-small-en-v1.5)
+**文档版本**: 1.0