- Implement tests for migration 005 to verify removal of deprecated fields in the database schema. - Ensure that new databases are created with a clean schema. - Validate that keywords are correctly extracted from the normalized file_keywords table. - Test symbol insertion without deprecated fields and subdir operations without direct_files. - Create a detailed search comparison test to evaluate vector search vs hybrid search performance. - Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality. - Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
7.4 KiB
Migration 005: Database Schema Cleanup
Overview
Migration 005 removes four unused and redundant database fields identified through Gemini analysis. This cleanup improves database efficiency, reduces schema complexity, and eliminates potential data consistency issues.
Schema Version
- Previous Version: 4
- New Version: 5
Changes Summary
1. Removed semantic_metadata.keywords Column
Reason: Deprecated - replaced by normalized file_keywords table in migration 001.
Impact:
- Keywords are now exclusively read from the normalized
file_keywordstable - Prevents data sync issues between JSON column and normalized tables
- No data loss - migration 001 already populated
file_keywordstable
Modified Code:
get_semantic_metadata(): Now reads keywords fromfile_keywordsJOINlist_semantic_metadata(): Updated to queryfile_keywordsfor each resultadd_semantic_metadata(): Stopped writing tokeywordscolumn (only writes tofile_keywords)
2. Removed symbols.token_count Column
Reason: Unused - always NULL, never populated.
Impact:
- No data loss (column was never used)
- Reduces symbols table size
- Simplifies symbol insertion logic
Modified Code:
add_file(): Removedtoken_countfrom INSERT statementsupdate_file_symbols(): Removedtoken_countfrom INSERT statements- Schema creation: No longer creates
token_countcolumn
3. Removed symbols.symbol_type Column
Reason: Redundant - duplicates symbols.kind field.
Impact:
- No data loss (information preserved in
kindcolumn) - Reduces symbols table size
- Eliminates redundant data storage
Modified Code:
add_file(): Removedsymbol_typefrom INSERT statementsupdate_file_symbols(): Removedsymbol_typefrom INSERT statements- Schema creation: No longer creates
symbol_typecolumn - Removed
idx_symbols_typeindex
4. Removed subdirs.direct_files Column
Reason: Unused - never displayed or queried in application logic.
Impact:
- No data loss (column was never used)
- Reduces subdirs table size
- Simplifies subdirectory registration
Modified Code:
register_subdir(): Parameter kept for backward compatibility but ignoredupdate_subdir_stats(): Parameter kept for backward compatibility but ignoredget_subdirs(): No longer retrievesdirect_filesget_subdir(): No longer retrievesdirect_filesSubdirLinkdataclass: Removeddirect_filesfield
Migration Process
Automatic Migration (v4 → v5)
When an existing database (version 4) is opened:
- Transaction begins
- Step 1: Recreate
semantic_metadatatable withoutkeywordscolumn- Data copied from old table (excluding
keywords) - Old table dropped, new table renamed
- Data copied from old table (excluding
- Step 2: Recreate
symbolstable withouttoken_countandsymbol_type- Data copied from old table (excluding removed columns)
- Old table dropped, new table renamed
- Indexes recreated (excluding
idx_symbols_type)
- Step 3: Recreate
subdirstable withoutdirect_files- Data copied from old table (excluding
direct_files) - Old table dropped, new table renamed
- Data copied from old table (excluding
- Transaction committed
- VACUUM runs to reclaim space (non-critical, continues if fails)
New Database Creation (v5)
New databases are created directly with the clean schema (no migration needed).
Benefits
- Reduced Database Size: Removed 4 unused columns across 3 tables
- Improved Data Consistency: Single source of truth for keywords (normalized tables)
- Simpler Code: Less maintenance burden for unused fields
- Better Performance: Smaller table sizes, fewer indexes to maintain
- Cleaner Schema: Easier to understand and maintain
Backward Compatibility
API Compatibility
All public APIs remain backward compatible:
register_subdir()andupdate_subdir_stats()still acceptdirect_filesparameter (ignored)SubdirLinkdataclass no longer hasdirect_filesattribute (breaking change for direct dataclass access)
Database Compatibility
- v4 databases: Automatically migrated to v5 on first access
- v5 databases: No migration needed
- Older databases (v0-v3): Migrate through chain (v0→v2→v4→v5)
Testing
Comprehensive test suite added: tests/test_schema_cleanup_migration.py
Test Coverage:
- ✅ Migration from v4 to v5
- ✅ New database creation with clean schema
- ✅ Semantic metadata keywords read from normalized table
- ✅ Symbols insert without deprecated fields
- ✅ Subdir operations without
direct_files
Test Results: All 5 tests passing
Verification
To verify migration success:
from codexlens.storage.dir_index import DirIndexStore
store = DirIndexStore("path/to/_index.db")
store.initialize()
# Check schema version
conn = store._get_connection()
version = conn.execute("PRAGMA user_version").fetchone()[0]
assert version == 5
# Check columns removed
cursor = conn.execute("PRAGMA table_info(semantic_metadata)")
columns = {row[1] for row in cursor.fetchall()}
assert "keywords" not in columns
cursor = conn.execute("PRAGMA table_info(symbols)")
columns = {row[1] for row in cursor.fetchall()}
assert "token_count" not in columns
assert "symbol_type" not in columns
cursor = conn.execute("PRAGMA table_info(subdirs)")
columns = {row[1] for row in cursor.fetchall()}
assert "direct_files" not in columns
store.close()
Performance Impact
Expected Improvements:
- Database size reduction: ~10-15% (varies by data)
- VACUUM reclaims space immediately after migration
- Slightly faster queries (smaller tables, fewer indexes)
Rollback
Migration 005 is one-way (no downgrade function). Removed fields contain:
keywords: Already migrated to normalized tables (migration 001)token_count: Always NULL (no data)symbol_type: Duplicate ofkind(no data loss)direct_files: Never used (no data)
If rollback is needed, restore from backup before running migration.
Files Modified
-
Migration File:
src/codexlens/storage/migrations/migration_005_cleanup_unused_fields.py(NEW)
-
Core Storage:
src/codexlens/storage/dir_index.py:- Updated
SCHEMA_VERSIONto 5 - Added migration 005 to
_apply_migrations() - Updated
get_semantic_metadata()to read fromfile_keywords - Updated
list_semantic_metadata()to read fromfile_keywords - Updated
add_semantic_metadata()to not writekeywordscolumn - Updated
add_file()to not writetoken_count/symbol_type - Updated
update_file_symbols()to not writetoken_count/symbol_type - Updated
register_subdir()to not writedirect_files - Updated
update_subdir_stats()to not writedirect_files - Updated
get_subdirs()to not readdirect_files - Updated
get_subdir()to not readdirect_files - Updated
SubdirLinkdataclass to removedirect_files - Updated
_create_schema()to create v5 schema directly
- Updated
-
Tests:
tests/test_schema_cleanup_migration.py(NEW)
Deployment Checklist
- Migration script created and tested
- Schema version updated to 5
- All code updated to use new schema
- Comprehensive tests added
- Existing tests pass
- Documentation updated
- Backward compatibility verified
References
- Original Analysis: Gemini code review identified unused/redundant fields
- Migration Pattern: Follows SQLite best practices (table recreation)
- Previous Migrations: 001 (keywords normalization), 004 (dual FTS)