Files
Claude-Code-Workflow/codex-lens/docs/MIGRATION_005_SUMMARY.md
catlog22 df23975a0b Add comprehensive tests for schema cleanup migration and search comparison
- Implement tests for migration 005 to verify removal of deprecated fields in the database schema.
- Ensure that new databases are created with a clean schema.
- Validate that keywords are correctly extracted from the normalized file_keywords table.
- Test symbol insertion without deprecated fields and subdir operations without direct_files.
- Create a detailed search comparison test to evaluate vector search vs hybrid search performance.
- Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality.
- Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2025-12-16 19:27:05 +08:00

221 lines
7.4 KiB
Markdown

# Migration 005: Database Schema Cleanup
## Overview
Migration 005 removes four unused and redundant database fields identified through Gemini analysis. This cleanup improves database efficiency, reduces schema complexity, and eliminates potential data consistency issues.
## Schema Version
- **Previous Version**: 4
- **New Version**: 5
## Changes Summary
### 1. Removed `semantic_metadata.keywords` Column
**Reason**: Deprecated - replaced by normalized `file_keywords` table in migration 001.
**Impact**:
- Keywords are now exclusively read from the normalized `file_keywords` table
- Prevents data sync issues between JSON column and normalized tables
- No data loss - migration 001 already populated `file_keywords` table
**Modified Code**:
- `get_semantic_metadata()`: Now reads keywords from `file_keywords` JOIN
- `list_semantic_metadata()`: Updated to query `file_keywords` for each result
- `add_semantic_metadata()`: Stopped writing to `keywords` column (only writes to `file_keywords`)
### 2. Removed `symbols.token_count` Column
**Reason**: Unused - always NULL, never populated.
**Impact**:
- No data loss (column was never used)
- Reduces symbols table size
- Simplifies symbol insertion logic
**Modified Code**:
- `add_file()`: Removed `token_count` from INSERT statements
- `update_file_symbols()`: Removed `token_count` from INSERT statements
- Schema creation: No longer creates `token_count` column
### 3. Removed `symbols.symbol_type` Column
**Reason**: Redundant - duplicates `symbols.kind` field.
**Impact**:
- No data loss (information preserved in `kind` column)
- Reduces symbols table size
- Eliminates redundant data storage
**Modified Code**:
- `add_file()`: Removed `symbol_type` from INSERT statements
- `update_file_symbols()`: Removed `symbol_type` from INSERT statements
- Schema creation: No longer creates `symbol_type` column
- Removed `idx_symbols_type` index
### 4. Removed `subdirs.direct_files` Column
**Reason**: Unused - never displayed or queried in application logic.
**Impact**:
- No data loss (column was never used)
- Reduces subdirs table size
- Simplifies subdirectory registration
**Modified Code**:
- `register_subdir()`: Parameter kept for backward compatibility but ignored
- `update_subdir_stats()`: Parameter kept for backward compatibility but ignored
- `get_subdirs()`: No longer retrieves `direct_files`
- `get_subdir()`: No longer retrieves `direct_files`
- `SubdirLink` dataclass: Removed `direct_files` field
## Migration Process
### Automatic Migration (v4 → v5)
When an existing database (version 4) is opened:
1. **Transaction begins**
2. **Step 1**: Recreate `semantic_metadata` table without `keywords` column
- Data copied from old table (excluding `keywords`)
- Old table dropped, new table renamed
3. **Step 2**: Recreate `symbols` table without `token_count` and `symbol_type`
- Data copied from old table (excluding removed columns)
- Old table dropped, new table renamed
- Indexes recreated (excluding `idx_symbols_type`)
4. **Step 3**: Recreate `subdirs` table without `direct_files`
- Data copied from old table (excluding `direct_files`)
- Old table dropped, new table renamed
5. **Transaction committed**
6. **VACUUM** runs to reclaim space (non-critical, continues if fails)
### New Database Creation (v5)
New databases are created directly with the clean schema (no migration needed).
## Benefits
1. **Reduced Database Size**: Removed 4 unused columns across 3 tables
2. **Improved Data Consistency**: Single source of truth for keywords (normalized tables)
3. **Simpler Code**: Less maintenance burden for unused fields
4. **Better Performance**: Smaller table sizes, fewer indexes to maintain
5. **Cleaner Schema**: Easier to understand and maintain
## Backward Compatibility
### API Compatibility
All public APIs remain backward compatible:
- `register_subdir()` and `update_subdir_stats()` still accept `direct_files` parameter (ignored)
- `SubdirLink` dataclass no longer has `direct_files` attribute (breaking change for direct dataclass access)
### Database Compatibility
- **v4 databases**: Automatically migrated to v5 on first access
- **v5 databases**: No migration needed
- **Older databases (v0-v3)**: Migrate through chain (v0→v2→v4→v5)
## Testing
Comprehensive test suite added: `tests/test_schema_cleanup_migration.py`
**Test Coverage**:
- ✅ Migration from v4 to v5
- ✅ New database creation with clean schema
- ✅ Semantic metadata keywords read from normalized table
- ✅ Symbols insert without deprecated fields
- ✅ Subdir operations without `direct_files`
**Test Results**: All 5 tests passing
## Verification
To verify migration success:
```python
from codexlens.storage.dir_index import DirIndexStore
store = DirIndexStore("path/to/_index.db")
store.initialize()
# Check schema version
conn = store._get_connection()
version = conn.execute("PRAGMA user_version").fetchone()[0]
assert version == 5
# Check columns removed
cursor = conn.execute("PRAGMA table_info(semantic_metadata)")
columns = {row[1] for row in cursor.fetchall()}
assert "keywords" not in columns
cursor = conn.execute("PRAGMA table_info(symbols)")
columns = {row[1] for row in cursor.fetchall()}
assert "token_count" not in columns
assert "symbol_type" not in columns
cursor = conn.execute("PRAGMA table_info(subdirs)")
columns = {row[1] for row in cursor.fetchall()}
assert "direct_files" not in columns
store.close()
```
## Performance Impact
**Expected Improvements**:
- Database size reduction: ~10-15% (varies by data)
- VACUUM reclaims space immediately after migration
- Slightly faster queries (smaller tables, fewer indexes)
## Rollback
Migration 005 is **one-way** (no downgrade function). Removed fields contain:
- `keywords`: Already migrated to normalized tables (migration 001)
- `token_count`: Always NULL (no data)
- `symbol_type`: Duplicate of `kind` (no data loss)
- `direct_files`: Never used (no data)
If rollback is needed, restore from backup before running migration.
## Files Modified
1. **Migration File**:
- `src/codexlens/storage/migrations/migration_005_cleanup_unused_fields.py` (NEW)
2. **Core Storage**:
- `src/codexlens/storage/dir_index.py`:
- Updated `SCHEMA_VERSION` to 5
- Added migration 005 to `_apply_migrations()`
- Updated `get_semantic_metadata()` to read from `file_keywords`
- Updated `list_semantic_metadata()` to read from `file_keywords`
- Updated `add_semantic_metadata()` to not write `keywords` column
- Updated `add_file()` to not write `token_count`/`symbol_type`
- Updated `update_file_symbols()` to not write `token_count`/`symbol_type`
- Updated `register_subdir()` to not write `direct_files`
- Updated `update_subdir_stats()` to not write `direct_files`
- Updated `get_subdirs()` to not read `direct_files`
- Updated `get_subdir()` to not read `direct_files`
- Updated `SubdirLink` dataclass to remove `direct_files`
- Updated `_create_schema()` to create v5 schema directly
3. **Tests**:
- `tests/test_schema_cleanup_migration.py` (NEW)
## Deployment Checklist
- [x] Migration script created and tested
- [x] Schema version updated to 5
- [x] All code updated to use new schema
- [x] Comprehensive tests added
- [x] Existing tests pass
- [x] Documentation updated
- [x] Backward compatibility verified
## References
- Original Analysis: Gemini code review identified unused/redundant fields
- Migration Pattern: Follows SQLite best practices (table recreation)
- Previous Migrations: 001 (keywords normalization), 004 (dual FTS)