Files
Claude-Code-Workflow/codex-lens/docs/MIGRATION_005_SUMMARY.md
catlog22 df23975a0b Add comprehensive tests for schema cleanup migration and search comparison
- Implement tests for migration 005 to verify removal of deprecated fields in the database schema.
- Ensure that new databases are created with a clean schema.
- Validate that keywords are correctly extracted from the normalized file_keywords table.
- Test symbol insertion without deprecated fields and subdir operations without direct_files.
- Create a detailed search comparison test to evaluate vector search vs hybrid search performance.
- Add a script for reindexing projects to extract code relationships and verify GraphAnalyzer functionality.
- Include a test script to check TreeSitter parser availability and relationship extraction from sample files.
2025-12-16 19:27:05 +08:00

7.4 KiB

Migration 005: Database Schema Cleanup

Overview

Migration 005 removes four unused and redundant database fields identified through Gemini analysis. This cleanup improves database efficiency, reduces schema complexity, and eliminates potential data consistency issues.

Schema Version

  • Previous Version: 4
  • New Version: 5

Changes Summary

1. Removed semantic_metadata.keywords Column

Reason: Deprecated - replaced by normalized file_keywords table in migration 001.

Impact:

  • Keywords are now exclusively read from the normalized file_keywords table
  • Prevents data sync issues between JSON column and normalized tables
  • No data loss - migration 001 already populated file_keywords table

Modified Code:

  • get_semantic_metadata(): Now reads keywords from file_keywords JOIN
  • list_semantic_metadata(): Updated to query file_keywords for each result
  • add_semantic_metadata(): Stopped writing to keywords column (only writes to file_keywords)

2. Removed symbols.token_count Column

Reason: Unused - always NULL, never populated.

Impact:

  • No data loss (column was never used)
  • Reduces symbols table size
  • Simplifies symbol insertion logic

Modified Code:

  • add_file(): Removed token_count from INSERT statements
  • update_file_symbols(): Removed token_count from INSERT statements
  • Schema creation: No longer creates token_count column

3. Removed symbols.symbol_type Column

Reason: Redundant - duplicates symbols.kind field.

Impact:

  • No data loss (information preserved in kind column)
  • Reduces symbols table size
  • Eliminates redundant data storage

Modified Code:

  • add_file(): Removed symbol_type from INSERT statements
  • update_file_symbols(): Removed symbol_type from INSERT statements
  • Schema creation: No longer creates symbol_type column
  • Removed idx_symbols_type index

4. Removed subdirs.direct_files Column

Reason: Unused - never displayed or queried in application logic.

Impact:

  • No data loss (column was never used)
  • Reduces subdirs table size
  • Simplifies subdirectory registration

Modified Code:

  • register_subdir(): Parameter kept for backward compatibility but ignored
  • update_subdir_stats(): Parameter kept for backward compatibility but ignored
  • get_subdirs(): No longer retrieves direct_files
  • get_subdir(): No longer retrieves direct_files
  • SubdirLink dataclass: Removed direct_files field

Migration Process

Automatic Migration (v4 → v5)

When an existing database (version 4) is opened:

  1. Transaction begins
  2. Step 1: Recreate semantic_metadata table without keywords column
    • Data copied from old table (excluding keywords)
    • Old table dropped, new table renamed
  3. Step 2: Recreate symbols table without token_count and symbol_type
    • Data copied from old table (excluding removed columns)
    • Old table dropped, new table renamed
    • Indexes recreated (excluding idx_symbols_type)
  4. Step 3: Recreate subdirs table without direct_files
    • Data copied from old table (excluding direct_files)
    • Old table dropped, new table renamed
  5. Transaction committed
  6. VACUUM runs to reclaim space (non-critical, continues if fails)

New Database Creation (v5)

New databases are created directly with the clean schema (no migration needed).

Benefits

  1. Reduced Database Size: Removed 4 unused columns across 3 tables
  2. Improved Data Consistency: Single source of truth for keywords (normalized tables)
  3. Simpler Code: Less maintenance burden for unused fields
  4. Better Performance: Smaller table sizes, fewer indexes to maintain
  5. Cleaner Schema: Easier to understand and maintain

Backward Compatibility

API Compatibility

All public APIs remain backward compatible:

  • register_subdir() and update_subdir_stats() still accept direct_files parameter (ignored)
  • SubdirLink dataclass no longer has direct_files attribute (breaking change for direct dataclass access)

Database Compatibility

  • v4 databases: Automatically migrated to v5 on first access
  • v5 databases: No migration needed
  • Older databases (v0-v3): Migrate through chain (v0→v2→v4→v5)

Testing

Comprehensive test suite added: tests/test_schema_cleanup_migration.py

Test Coverage:

  • Migration from v4 to v5
  • New database creation with clean schema
  • Semantic metadata keywords read from normalized table
  • Symbols insert without deprecated fields
  • Subdir operations without direct_files

Test Results: All 5 tests passing

Verification

To verify migration success:

from codexlens.storage.dir_index import DirIndexStore

store = DirIndexStore("path/to/_index.db")
store.initialize()

# Check schema version
conn = store._get_connection()
version = conn.execute("PRAGMA user_version").fetchone()[0]
assert version == 5

# Check columns removed
cursor = conn.execute("PRAGMA table_info(semantic_metadata)")
columns = {row[1] for row in cursor.fetchall()}
assert "keywords" not in columns

cursor = conn.execute("PRAGMA table_info(symbols)")
columns = {row[1] for row in cursor.fetchall()}
assert "token_count" not in columns
assert "symbol_type" not in columns

cursor = conn.execute("PRAGMA table_info(subdirs)")
columns = {row[1] for row in cursor.fetchall()}
assert "direct_files" not in columns

store.close()

Performance Impact

Expected Improvements:

  • Database size reduction: ~10-15% (varies by data)
  • VACUUM reclaims space immediately after migration
  • Slightly faster queries (smaller tables, fewer indexes)

Rollback

Migration 005 is one-way (no downgrade function). Removed fields contain:

  • keywords: Already migrated to normalized tables (migration 001)
  • token_count: Always NULL (no data)
  • symbol_type: Duplicate of kind (no data loss)
  • direct_files: Never used (no data)

If rollback is needed, restore from backup before running migration.

Files Modified

  1. Migration File:

    • src/codexlens/storage/migrations/migration_005_cleanup_unused_fields.py (NEW)
  2. Core Storage:

    • src/codexlens/storage/dir_index.py:
      • Updated SCHEMA_VERSION to 5
      • Added migration 005 to _apply_migrations()
      • Updated get_semantic_metadata() to read from file_keywords
      • Updated list_semantic_metadata() to read from file_keywords
      • Updated add_semantic_metadata() to not write keywords column
      • Updated add_file() to not write token_count/symbol_type
      • Updated update_file_symbols() to not write token_count/symbol_type
      • Updated register_subdir() to not write direct_files
      • Updated update_subdir_stats() to not write direct_files
      • Updated get_subdirs() to not read direct_files
      • Updated get_subdir() to not read direct_files
      • Updated SubdirLink dataclass to remove direct_files
      • Updated _create_schema() to create v5 schema directly
  3. Tests:

    • tests/test_schema_cleanup_migration.py (NEW)

Deployment Checklist

  • Migration script created and tested
  • Schema version updated to 5
  • All code updated to use new schema
  • Comprehensive tests added
  • Existing tests pass
  • Documentation updated
  • Backward compatibility verified

References

  • Original Analysis: Gemini code review identified unused/redundant fields
  • Migration Pattern: Follows SQLite best practices (table recreation)
  • Previous Migrations: 001 (keywords normalization), 004 (dual FTS)