Refactor agent spawning and delegation check mechanisms

- Updated agent spawning from `Task()` to `Agent()` across various files to align with new standards.
- Enhanced the `code-developer` agent description to clarify its invocation context and responsibilities.
- Introduced a new `delegation-check` skill to validate command delegation prompts against agent role definitions, ensuring content separation and conflict detection.
- Established comprehensive separation rules for command delegation prompts and agent definitions, detailing ownership and conflict patterns.
- Improved documentation for command and agent design specifications to reflect the updated spawning patterns and validation processes.
This commit is contained in:
catlog22
2026-03-17 12:55:14 +08:00
parent e6255cf41a
commit bfe5426b7e
31 changed files with 3203 additions and 200 deletions

21
codex-lens-v2/LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 codexlens-search contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

146
codex-lens-v2/README.md Normal file
View File

@@ -0,0 +1,146 @@
# codexlens-search
Lightweight semantic code search engine with 2-stage vector search, full-text search, and Reciprocal Rank Fusion.
## Overview
codexlens-search provides fast, accurate code search through a multi-stage retrieval pipeline:
1. **Binary coarse search** - Hamming-distance filtering narrows candidates quickly
2. **ANN fine search** - HNSW or FAISS refines the candidate set with float vectors
3. **Full-text search** - SQLite FTS5 handles exact and fuzzy keyword matching
4. **RRF fusion** - Reciprocal Rank Fusion merges vector and text results
5. **Reranking** - Optional cross-encoder or API-based reranker for final ordering
The core library has **zero required dependencies**. Install optional extras to enable semantic search, GPU acceleration, or FAISS backends.
## Installation
```bash
# Core only (FTS search, no vector search)
pip install codexlens-search
# With semantic search (recommended)
pip install codexlens-search[semantic]
# Semantic search + GPU acceleration
pip install codexlens-search[semantic-gpu]
# With FAISS backend (CPU)
pip install codexlens-search[faiss-cpu]
# With API-based reranker
pip install codexlens-search[reranker-api]
# Everything (semantic + GPU + FAISS + reranker)
pip install codexlens-search[semantic-gpu,faiss-gpu,reranker-api]
```
## Quick Start
```python
from codexlens_search import Config, IndexingPipeline, SearchPipeline
from codexlens_search.core import create_ann_index, create_binary_index
from codexlens_search.embed.local import FastEmbedEmbedder
from codexlens_search.rerank.local import LocalReranker
from codexlens_search.search.fts import FTSEngine
# 1. Configure
config = Config(embed_model="BAAI/bge-small-en-v1.5", embed_dim=384)
# 2. Create components
embedder = FastEmbedEmbedder(config)
binary_store = create_binary_index(config, db_path="index/binary.db")
ann_index = create_ann_index(config, index_path="index/ann.bin")
fts = FTSEngine("index/fts.db")
reranker = LocalReranker()
# 3. Index files
indexer = IndexingPipeline(embedder, binary_store, ann_index, fts, config)
stats = indexer.index_directory("./src")
print(f"Indexed {stats.files_processed} files, {stats.chunks_created} chunks")
# 4. Search
pipeline = SearchPipeline(embedder, binary_store, ann_index, reranker, fts, config)
results = pipeline.search("authentication handler", top_k=10)
for r in results:
print(f" {r.path} (score={r.score:.3f})")
```
## Extras
| Extra | Dependencies | Description |
|-------|-------------|-------------|
| `semantic` | hnswlib, numpy, fastembed | Vector search with local embeddings |
| `gpu` | onnxruntime-gpu | GPU-accelerated embedding inference |
| `semantic-gpu` | semantic + gpu combined | Vector search with GPU acceleration |
| `faiss-cpu` | faiss-cpu | FAISS ANN backend (CPU) |
| `faiss-gpu` | faiss-gpu | FAISS ANN backend (GPU) |
| `reranker-api` | httpx | Remote reranker API client |
| `dev` | pytest, pytest-cov | Development and testing |
## Architecture
```
Query
|
v
[Embedder] --> query vector
|
+---> [BinaryStore.coarse_search] --> candidate IDs (Hamming distance)
| |
| v
+---> [ANNIndex.fine_search] ------> ranked IDs (cosine/L2)
| |
| v (intersect)
| vector_results
|
+---> [FTSEngine.exact_search] ----> exact text matches
+---> [FTSEngine.fuzzy_search] ----> fuzzy text matches
|
v
[RRF Fusion] --> merged ranking (adaptive weights by query intent)
|
v
[Reranker] --> final top-k results
```
### Key Design Decisions
- **2-stage vector search**: Binary coarse search (fast Hamming distance on binarized vectors) filters candidates before the more expensive ANN search. This keeps memory usage low and search fast even on large corpora.
- **Parallel retrieval**: Vector search and FTS run concurrently via ThreadPoolExecutor.
- **Adaptive fusion weights**: Query intent detection adjusts RRF weights between vector and text signals.
- **Backend abstraction**: ANN index supports both hnswlib and FAISS backends via a factory function.
- **Zero core dependencies**: The base package requires only Python 3.10+. All heavy dependencies are optional.
## Configuration
The `Config` dataclass controls all pipeline parameters:
```python
from codexlens_search import Config
config = Config(
embed_model="BAAI/bge-small-en-v1.5", # embedding model name
embed_dim=384, # embedding dimension
embed_batch_size=64, # batch size for embedding
ann_backend="auto", # 'auto', 'faiss', 'hnswlib'
binary_top_k=200, # binary coarse search candidates
ann_top_k=50, # ANN fine search candidates
fts_top_k=50, # FTS results per method
device="auto", # 'auto', 'cuda', 'cpu'
)
```
## Development
```bash
git clone https://github.com/nicepkg/codexlens-search.git
cd codexlens-search
pip install -e ".[dev,semantic]"
pytest
```
## License
MIT

Binary file not shown.

Binary file not shown.

View File

@@ -8,6 +8,26 @@ version = "0.2.0"
description = "Lightweight semantic code search engine — 2-stage vector + FTS + RRF fusion"
requires-python = ">=3.10"
dependencies = []
license = {text = "MIT"}
readme = "README.md"
authors = [
{name = "codexlens-search contributors"},
]
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"License :: OSI Approved :: MIT License",
"Topic :: Software Development :: Libraries",
"Topic :: Text Processing :: Indexing",
"Operating System :: OS Independent",
]
[project.urls]
Homepage = "https://github.com/nicepkg/codexlens-search"
Repository = "https://github.com/nicepkg/codexlens-search"
[project.optional-dependencies]
semantic = [
@@ -27,10 +47,22 @@ faiss-gpu = [
reranker-api = [
"httpx>=0.25",
]
watcher = [
"watchdog>=3.0",
]
semantic-gpu = [
"hnswlib>=0.8.0",
"numpy>=1.26",
"fastembed>=0.4.0,<2.0",
"onnxruntime-gpu>=1.16",
]
dev = [
"pytest>=7.0",
"pytest-cov",
]
[project.scripts]
codexlens-search = "codexlens_search.bridge:main"
[tool.hatch.build.targets.wheel]
packages = ["src/codexlens_search"]

View File

@@ -0,0 +1,407 @@
"""CLI bridge for ccw integration.
Argparse-based CLI with JSON output protocol.
Each subcommand outputs a single JSON object to stdout.
Watch command outputs JSONL (one JSON per line).
All errors are JSON {"error": string} to stdout with non-zero exit code.
"""
from __future__ import annotations
import argparse
import glob
import json
import logging
import os
import sys
import time
from pathlib import Path
log = logging.getLogger("codexlens_search.bridge")
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _json_output(data: dict | list) -> None:
"""Print JSON to stdout with flush."""
print(json.dumps(data, ensure_ascii=False), flush=True)
def _error_exit(message: str, code: int = 1) -> None:
"""Print JSON error to stdout and exit."""
_json_output({"error": message})
sys.exit(code)
def _resolve_db_path(args: argparse.Namespace) -> Path:
"""Return the --db-path as a resolved Path, creating parent dirs."""
db_path = Path(args.db_path).resolve()
db_path.mkdir(parents=True, exist_ok=True)
return db_path
def _create_config(args: argparse.Namespace) -> "Config":
"""Build Config from CLI args."""
from codexlens_search.config import Config
kwargs: dict = {}
if hasattr(args, "embed_model") and args.embed_model:
kwargs["embed_model"] = args.embed_model
db_path = Path(args.db_path).resolve()
kwargs["metadata_db_path"] = str(db_path / "metadata.db")
return Config(**kwargs)
def _create_pipeline(
args: argparse.Namespace,
) -> tuple:
"""Lazily construct pipeline components from CLI args.
Returns (indexing_pipeline, search_pipeline, config).
Only loads embedder/reranker models when needed.
"""
from codexlens_search.config import Config
from codexlens_search.core.factory import create_ann_index, create_binary_index
from codexlens_search.embed.local import FastEmbedEmbedder
from codexlens_search.indexing.metadata import MetadataStore
from codexlens_search.indexing.pipeline import IndexingPipeline
from codexlens_search.rerank.local import FastEmbedReranker
from codexlens_search.search.fts import FTSEngine
from codexlens_search.search.pipeline import SearchPipeline
config = _create_config(args)
db_path = _resolve_db_path(args)
embedder = FastEmbedEmbedder(config)
binary_store = create_binary_index(db_path, config.embed_dim, config)
ann_index = create_ann_index(db_path, config.embed_dim, config)
fts = FTSEngine(db_path / "fts.db")
metadata = MetadataStore(db_path / "metadata.db")
reranker = FastEmbedReranker(config)
indexing = IndexingPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
fts=fts,
config=config,
metadata=metadata,
)
search = SearchPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
reranker=reranker,
fts=fts,
config=config,
metadata_store=metadata,
)
return indexing, search, config
# ---------------------------------------------------------------------------
# Subcommand handlers
# ---------------------------------------------------------------------------
def cmd_init(args: argparse.Namespace) -> None:
"""Initialize an empty index at --db-path."""
from codexlens_search.indexing.metadata import MetadataStore
from codexlens_search.search.fts import FTSEngine
db_path = _resolve_db_path(args)
# Create empty stores - just touch the metadata and FTS databases
MetadataStore(db_path / "metadata.db")
FTSEngine(db_path / "fts.db")
_json_output({
"status": "initialized",
"db_path": str(db_path),
})
def cmd_search(args: argparse.Namespace) -> None:
"""Run search query, output JSON array of results."""
_, search, _ = _create_pipeline(args)
results = search.search(args.query, top_k=args.top_k)
_json_output([
{"path": r.path, "score": r.score, "snippet": r.snippet}
for r in results
])
def cmd_index_file(args: argparse.Namespace) -> None:
"""Index a single file."""
indexing, _, _ = _create_pipeline(args)
file_path = Path(args.file).resolve()
if not file_path.is_file():
_error_exit(f"File not found: {file_path}")
root = Path(args.root).resolve() if args.root else None
stats = indexing.index_file(file_path, root=root)
_json_output({
"status": "indexed",
"file": str(file_path),
"files_processed": stats.files_processed,
"chunks_created": stats.chunks_created,
"duration_seconds": stats.duration_seconds,
})
def cmd_remove_file(args: argparse.Namespace) -> None:
"""Remove a file from the index."""
indexing, _, _ = _create_pipeline(args)
indexing.remove_file(args.file)
_json_output({
"status": "removed",
"file": args.file,
})
def cmd_sync(args: argparse.Namespace) -> None:
"""Sync index with files under --root matching --glob pattern."""
indexing, _, _ = _create_pipeline(args)
root = Path(args.root).resolve()
if not root.is_dir():
_error_exit(f"Root directory not found: {root}")
pattern = args.glob or "**/*"
file_paths = [
p for p in root.glob(pattern)
if p.is_file()
]
stats = indexing.sync(file_paths, root=root)
_json_output({
"status": "synced",
"root": str(root),
"files_processed": stats.files_processed,
"chunks_created": stats.chunks_created,
"duration_seconds": stats.duration_seconds,
})
def cmd_watch(args: argparse.Namespace) -> None:
"""Watch --root for changes, output JSONL events."""
root = Path(args.root).resolve()
if not root.is_dir():
_error_exit(f"Root directory not found: {root}")
debounce_ms = args.debounce_ms
try:
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler, FileSystemEvent
except ImportError:
_error_exit(
"watchdog is required for watch mode. "
"Install with: pip install watchdog"
)
class _JsonEventHandler(FileSystemEventHandler):
"""Emit JSONL for file events."""
def _emit(self, event_type: str, path: str) -> None:
_json_output({
"event": event_type,
"path": path,
"timestamp": time.time(),
})
def on_created(self, event: FileSystemEvent) -> None:
if not event.is_directory:
self._emit("created", event.src_path)
def on_modified(self, event: FileSystemEvent) -> None:
if not event.is_directory:
self._emit("modified", event.src_path)
def on_deleted(self, event: FileSystemEvent) -> None:
if not event.is_directory:
self._emit("deleted", event.src_path)
def on_moved(self, event: FileSystemEvent) -> None:
if not event.is_directory:
self._emit("moved", event.dest_path)
observer = Observer()
observer.schedule(_JsonEventHandler(), str(root), recursive=True)
observer.start()
_json_output({
"status": "watching",
"root": str(root),
"debounce_ms": debounce_ms,
})
try:
while True:
time.sleep(debounce_ms / 1000.0)
except KeyboardInterrupt:
observer.stop()
observer.join()
def cmd_download_models(args: argparse.Namespace) -> None:
"""Download embed + reranker models."""
from codexlens_search import model_manager
config = _create_config(args)
model_manager.ensure_model(config.embed_model, config)
model_manager.ensure_model(config.reranker_model, config)
_json_output({
"status": "downloaded",
"embed_model": config.embed_model,
"reranker_model": config.reranker_model,
})
def cmd_status(args: argparse.Namespace) -> None:
"""Report index statistics."""
from codexlens_search.indexing.metadata import MetadataStore
db_path = _resolve_db_path(args)
meta_path = db_path / "metadata.db"
if not meta_path.exists():
_json_output({
"status": "not_initialized",
"db_path": str(db_path),
})
return
metadata = MetadataStore(meta_path)
all_files = metadata.get_all_files()
deleted_ids = metadata.get_deleted_ids()
max_chunk = metadata.max_chunk_id()
_json_output({
"status": "ok",
"db_path": str(db_path),
"files_tracked": len(all_files),
"max_chunk_id": max_chunk,
"total_chunks_approx": max_chunk + 1 if max_chunk >= 0 else 0,
"deleted_chunks": len(deleted_ids),
})
# ---------------------------------------------------------------------------
# CLI parser
# ---------------------------------------------------------------------------
def _build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
prog="codexlens-search",
description="Lightweight semantic code search - CLI bridge",
)
parser.add_argument(
"--db-path",
default=os.environ.get("CODEXLENS_DB_PATH", ".codexlens"),
help="Path to index database directory (default: .codexlens or $CODEXLENS_DB_PATH)",
)
parser.add_argument(
"--verbose", "-v",
action="store_true",
help="Enable debug logging to stderr",
)
sub = parser.add_subparsers(dest="command")
# init
sub.add_parser("init", help="Initialize empty index")
# search
p_search = sub.add_parser("search", help="Search the index")
p_search.add_argument("--query", "-q", required=True, help="Search query")
p_search.add_argument("--top-k", "-k", type=int, default=10, help="Number of results")
# index-file
p_index = sub.add_parser("index-file", help="Index a single file")
p_index.add_argument("--file", "-f", required=True, help="File path to index")
p_index.add_argument("--root", "-r", help="Root directory for relative paths")
# remove-file
p_remove = sub.add_parser("remove-file", help="Remove a file from index")
p_remove.add_argument("--file", "-f", required=True, help="Relative file path to remove")
# sync
p_sync = sub.add_parser("sync", help="Sync index with directory")
p_sync.add_argument("--root", "-r", required=True, help="Root directory to sync")
p_sync.add_argument("--glob", "-g", default="**/*", help="Glob pattern (default: **/*)")
# watch
p_watch = sub.add_parser("watch", help="Watch directory for changes (JSONL output)")
p_watch.add_argument("--root", "-r", required=True, help="Root directory to watch")
p_watch.add_argument("--debounce-ms", type=int, default=500, help="Debounce interval in ms")
# download-models
p_dl = sub.add_parser("download-models", help="Download embed + reranker models")
p_dl.add_argument("--embed-model", help="Override embed model name")
# status
sub.add_parser("status", help="Report index statistics")
return parser
def main() -> None:
"""CLI entry point."""
parser = _build_parser()
args = parser.parse_args()
# Configure logging
if args.verbose:
logging.basicConfig(
level=logging.DEBUG,
format="%(levelname)s %(name)s: %(message)s",
stream=sys.stderr,
)
else:
logging.basicConfig(
level=logging.WARNING,
format="%(levelname)s: %(message)s",
stream=sys.stderr,
)
if not args.command:
parser.print_help(sys.stderr)
sys.exit(1)
dispatch = {
"init": cmd_init,
"search": cmd_search,
"index-file": cmd_index_file,
"remove-file": cmd_remove_file,
"sync": cmd_sync,
"watch": cmd_watch,
"download-models": cmd_download_models,
"status": cmd_status,
}
handler = dispatch.get(args.command)
if handler is None:
_error_exit(f"Unknown command: {args.command}")
try:
handler(args)
except KeyboardInterrupt:
sys.exit(130)
except SystemExit:
raise
except Exception as exc:
log.debug("Command failed", exc_info=True)
_error_exit(str(exc))
if __name__ == "__main__":
main()

View File

@@ -49,6 +49,9 @@ class Config:
reranker_api_model: str = ""
reranker_api_max_tokens_per_batch: int = 2048
# Metadata store
metadata_db_path: str = "" # empty = no metadata tracking
# FTS
fts_top_k: int = 50

View File

@@ -1,5 +1,6 @@
from __future__ import annotations
from .metadata import MetadataStore
from .pipeline import IndexingPipeline, IndexStats
__all__ = ["IndexingPipeline", "IndexStats"]
__all__ = ["IndexingPipeline", "IndexStats", "MetadataStore"]

View File

@@ -0,0 +1,165 @@
"""SQLite-backed metadata store for file-to-chunk mapping and tombstone tracking."""
from __future__ import annotations
import sqlite3
from pathlib import Path
class MetadataStore:
"""Tracks file-to-chunk mappings and deleted chunk IDs (tombstones).
Tables:
files - file_path (PK), content_hash, last_modified
chunks - chunk_id (PK), file_path (FK CASCADE), chunk_hash
deleted_chunks - chunk_id (PK) for tombstone tracking
"""
def __init__(self, db_path: str | Path) -> None:
self._conn = sqlite3.connect(str(db_path), check_same_thread=False)
self._conn.execute("PRAGMA foreign_keys = ON")
self._conn.execute("PRAGMA journal_mode = WAL")
self._create_tables()
def _create_tables(self) -> None:
self._conn.executescript("""
CREATE TABLE IF NOT EXISTS files (
file_path TEXT PRIMARY KEY,
content_hash TEXT NOT NULL,
last_modified REAL NOT NULL
);
CREATE TABLE IF NOT EXISTS chunks (
chunk_id INTEGER PRIMARY KEY,
file_path TEXT NOT NULL,
chunk_hash TEXT NOT NULL DEFAULT '',
FOREIGN KEY (file_path) REFERENCES files(file_path) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS deleted_chunks (
chunk_id INTEGER PRIMARY KEY
);
""")
self._conn.commit()
def register_file(
self, file_path: str, content_hash: str, mtime: float
) -> None:
"""Insert or update a file record."""
self._conn.execute(
"INSERT OR REPLACE INTO files (file_path, content_hash, last_modified) "
"VALUES (?, ?, ?)",
(file_path, content_hash, mtime),
)
self._conn.commit()
def register_chunks(
self, file_path: str, chunk_ids_and_hashes: list[tuple[int, str]]
) -> None:
"""Register chunk IDs belonging to a file.
Args:
file_path: The owning file path (must already exist in files table).
chunk_ids_and_hashes: List of (chunk_id, chunk_hash) tuples.
"""
if not chunk_ids_and_hashes:
return
self._conn.executemany(
"INSERT OR REPLACE INTO chunks (chunk_id, file_path, chunk_hash) "
"VALUES (?, ?, ?)",
[(cid, file_path, chash) for cid, chash in chunk_ids_and_hashes],
)
self._conn.commit()
def mark_file_deleted(self, file_path: str) -> int:
"""Move all chunk IDs for a file to deleted_chunks, then remove the file.
Returns the number of chunks tombstoned.
"""
# Collect chunk IDs before CASCADE deletes them
rows = self._conn.execute(
"SELECT chunk_id FROM chunks WHERE file_path = ?", (file_path,)
).fetchall()
if not rows:
# Still remove the file record if it exists
self._conn.execute(
"DELETE FROM files WHERE file_path = ?", (file_path,)
)
self._conn.commit()
return 0
chunk_ids = [(r[0],) for r in rows]
self._conn.executemany(
"INSERT OR IGNORE INTO deleted_chunks (chunk_id) VALUES (?)",
chunk_ids,
)
# CASCADE deletes chunks rows automatically
self._conn.execute(
"DELETE FROM files WHERE file_path = ?", (file_path,)
)
self._conn.commit()
return len(chunk_ids)
def get_deleted_ids(self) -> set[int]:
"""Return all tombstoned chunk IDs for search-time filtering."""
rows = self._conn.execute(
"SELECT chunk_id FROM deleted_chunks"
).fetchall()
return {r[0] for r in rows}
def get_file_hash(self, file_path: str) -> str | None:
"""Return the stored content hash for a file, or None if not tracked."""
row = self._conn.execute(
"SELECT content_hash FROM files WHERE file_path = ?", (file_path,)
).fetchone()
return row[0] if row else None
def file_needs_update(self, file_path: str, content_hash: str) -> bool:
"""Check if a file needs re-indexing based on its content hash."""
stored = self.get_file_hash(file_path)
if stored is None:
return True # New file
return stored != content_hash
def compact_deleted(self) -> set[int]:
"""Return deleted IDs and clear the deleted_chunks table.
Call this after rebuilding the vector index to reclaim space.
"""
deleted = self.get_deleted_ids()
if deleted:
self._conn.execute("DELETE FROM deleted_chunks")
self._conn.commit()
return deleted
def get_chunk_ids_for_file(self, file_path: str) -> list[int]:
"""Return all chunk IDs belonging to a file."""
rows = self._conn.execute(
"SELECT chunk_id FROM chunks WHERE file_path = ?", (file_path,)
).fetchall()
return [r[0] for r in rows]
def get_all_files(self) -> dict[str, str]:
"""Return all tracked files as {file_path: content_hash}."""
rows = self._conn.execute(
"SELECT file_path, content_hash FROM files"
).fetchall()
return {r[0]: r[1] for r in rows}
def max_chunk_id(self) -> int:
"""Return the maximum chunk_id across chunks and deleted_chunks.
Returns -1 if no chunks exist, so that next_id = max_chunk_id() + 1
starts at 0 for an empty store.
"""
row = self._conn.execute(
"SELECT MAX(m) FROM ("
" SELECT MAX(chunk_id) AS m FROM chunks"
" UNION ALL"
" SELECT MAX(chunk_id) AS m FROM deleted_chunks"
")"
).fetchone()
return row[0] if row[0] is not None else -1
def close(self) -> None:
self._conn.close()

View File

@@ -5,6 +5,7 @@ The GIL is acceptable because embedding (onnxruntime) releases it in C extension
"""
from __future__ import annotations
import hashlib
import logging
import queue
import threading
@@ -18,6 +19,7 @@ from codexlens_search.config import Config
from codexlens_search.core.binary import BinaryStore
from codexlens_search.core.index import ANNIndex
from codexlens_search.embed.base import BaseEmbedder
from codexlens_search.indexing.metadata import MetadataStore
from codexlens_search.search.fts import FTSEngine
logger = logging.getLogger(__name__)
@@ -55,12 +57,14 @@ class IndexingPipeline:
ann_index: ANNIndex,
fts: FTSEngine,
config: Config,
metadata: MetadataStore | None = None,
) -> None:
self._embedder = embedder
self._binary_store = binary_store
self._ann_index = ann_index
self._fts = fts
self._config = config
self._metadata = metadata
def index_files(
self,
@@ -275,3 +279,271 @@ class IndexingPipeline:
chunks.append(("".join(current), path))
return chunks
# ------------------------------------------------------------------
# Incremental API
# ------------------------------------------------------------------
@staticmethod
def _content_hash(text: str) -> str:
"""Compute SHA-256 hex digest of file content."""
return hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest()
def _require_metadata(self) -> MetadataStore:
"""Return metadata store or raise if not configured."""
if self._metadata is None:
raise RuntimeError(
"MetadataStore is required for incremental indexing. "
"Pass metadata= to IndexingPipeline.__init__."
)
return self._metadata
def _next_chunk_id(self) -> int:
"""Return the next available chunk ID from MetadataStore."""
meta = self._require_metadata()
return meta.max_chunk_id() + 1
def index_file(
self,
file_path: Path,
*,
root: Path | None = None,
force: bool = False,
max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS,
chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP,
max_file_size: int = 50_000,
) -> IndexStats:
"""Index a single file incrementally.
Skips files that have not changed (same content_hash) unless
*force* is True.
Args:
file_path: Path to the file to index.
root: Optional root for computing relative path identifiers.
force: Re-index even if content hash has not changed.
max_chunk_chars: Maximum characters per chunk.
chunk_overlap: Character overlap between consecutive chunks.
max_file_size: Skip files larger than this (bytes).
Returns:
IndexStats with counts and timing.
"""
meta = self._require_metadata()
t0 = time.monotonic()
# Read file
try:
if file_path.stat().st_size > max_file_size:
logger.debug("Skipping %s: exceeds max_file_size", file_path)
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
text = file_path.read_text(encoding="utf-8", errors="replace")
except Exception as exc:
logger.debug("Skipping %s: %s", file_path, exc)
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
content_hash = self._content_hash(text)
rel_path = str(file_path.relative_to(root)) if root else str(file_path)
# Check if update is needed
if not force and not meta.file_needs_update(rel_path, content_hash):
logger.debug("Skipping %s: unchanged", rel_path)
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
# If file was previously indexed, remove old data first
if meta.get_file_hash(rel_path) is not None:
meta.mark_file_deleted(rel_path)
self._fts.delete_by_path(rel_path)
# Chunk
file_chunks = self._chunk_text(text, rel_path, max_chunk_chars, chunk_overlap)
if not file_chunks:
# Register file with no chunks
meta.register_file(rel_path, content_hash, file_path.stat().st_mtime)
return IndexStats(
files_processed=1,
duration_seconds=round(time.monotonic() - t0, 2),
)
# Assign chunk IDs
start_id = self._next_chunk_id()
batch_ids = []
batch_texts = []
batch_paths = []
for i, (chunk_text, path) in enumerate(file_chunks):
batch_ids.append(start_id + i)
batch_texts.append(chunk_text)
batch_paths.append(path)
# Embed synchronously
vecs = self._embedder.embed_batch(batch_texts)
vec_array = np.array(vecs, dtype=np.float32)
id_array = np.array(batch_ids, dtype=np.int64)
# Index: write to stores
self._binary_store.add(id_array, vec_array)
self._ann_index.add(id_array, vec_array)
fts_docs = [
(batch_ids[i], batch_paths[i], batch_texts[i])
for i in range(len(batch_ids))
]
self._fts.add_documents(fts_docs)
# Register in metadata
meta.register_file(rel_path, content_hash, file_path.stat().st_mtime)
chunk_id_hashes = [
(batch_ids[i], self._content_hash(batch_texts[i]))
for i in range(len(batch_ids))
]
meta.register_chunks(rel_path, chunk_id_hashes)
# Flush stores
self._binary_store.save()
self._ann_index.save()
duration = time.monotonic() - t0
stats = IndexStats(
files_processed=1,
chunks_created=len(batch_ids),
duration_seconds=round(duration, 2),
)
logger.info(
"Indexed file %s: %d chunks in %.2fs",
rel_path, stats.chunks_created, stats.duration_seconds,
)
return stats
def remove_file(self, file_path: str) -> None:
"""Mark a file as deleted via tombstone strategy.
Marks all chunk IDs for the file in MetadataStore.deleted_chunks
and removes the file's FTS entries.
Args:
file_path: The relative path identifier of the file to remove.
"""
meta = self._require_metadata()
count = meta.mark_file_deleted(file_path)
fts_count = self._fts.delete_by_path(file_path)
logger.info(
"Removed file %s: %d chunks tombstoned, %d FTS entries deleted",
file_path, count, fts_count,
)
def sync(
self,
file_paths: list[Path],
*,
root: Path | None = None,
max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS,
chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP,
max_file_size: int = 50_000,
) -> IndexStats:
"""Reconcile index state against a current file list.
Identifies files that are new, changed, or removed and processes
each accordingly.
Args:
file_paths: Current list of files that should be indexed.
root: Optional root for computing relative path identifiers.
max_chunk_chars: Maximum characters per chunk.
chunk_overlap: Character overlap between consecutive chunks.
max_file_size: Skip files larger than this (bytes).
Returns:
Aggregated IndexStats for all operations.
"""
meta = self._require_metadata()
t0 = time.monotonic()
# Build set of current relative paths
current_rel_paths: dict[str, Path] = {}
for fpath in file_paths:
rel = str(fpath.relative_to(root)) if root else str(fpath)
current_rel_paths[rel] = fpath
# Get known files from metadata
known_files = meta.get_all_files() # {rel_path: content_hash}
# Detect removed files
removed = set(known_files.keys()) - set(current_rel_paths.keys())
for rel in removed:
self.remove_file(rel)
# Index new and changed files
total_files = 0
total_chunks = 0
for rel, fpath in current_rel_paths.items():
stats = self.index_file(
fpath,
root=root,
max_chunk_chars=max_chunk_chars,
chunk_overlap=chunk_overlap,
max_file_size=max_file_size,
)
total_files += stats.files_processed
total_chunks += stats.chunks_created
duration = time.monotonic() - t0
result = IndexStats(
files_processed=total_files,
chunks_created=total_chunks,
duration_seconds=round(duration, 2),
)
logger.info(
"Sync complete: %d files indexed, %d chunks created, "
"%d files removed in %.1fs",
result.files_processed, result.chunks_created,
len(removed), result.duration_seconds,
)
return result
def compact(self) -> None:
"""Rebuild indexes excluding tombstoned chunk IDs.
Reads all deleted IDs from MetadataStore, rebuilds BinaryStore
and ANNIndex without those entries, then clears the
deleted_chunks table.
"""
meta = self._require_metadata()
deleted_ids = meta.compact_deleted()
if not deleted_ids:
logger.debug("Compact: no deleted IDs, nothing to do")
return
logger.info("Compact: rebuilding indexes, excluding %d deleted IDs", len(deleted_ids))
# Rebuild BinaryStore: read current data, filter, replace
if self._binary_store._count > 0:
active_ids = self._binary_store._ids[: self._binary_store._count]
active_matrix = self._binary_store._matrix[: self._binary_store._count]
mask = ~np.isin(active_ids, list(deleted_ids))
kept_ids = active_ids[mask]
kept_matrix = active_matrix[mask]
# Reset store
self._binary_store._count = 0
self._binary_store._matrix = None
self._binary_store._ids = None
if len(kept_ids) > 0:
self._binary_store._ensure_capacity(len(kept_ids))
self._binary_store._matrix[: len(kept_ids)] = kept_matrix
self._binary_store._ids[: len(kept_ids)] = kept_ids
self._binary_store._count = len(kept_ids)
self._binary_store.save()
# Rebuild ANNIndex: must reconstruct from scratch since HNSW
# does not support deletion. We re-initialize and re-add kept items.
# Note: we need the float32 vectors, but BinaryStore only has quantized.
# ANNIndex (hnswlib) supports mark_deleted, but compact means full rebuild.
# Since we don't have original float vectors cached, we rely on the fact
# that ANNIndex.mark_deleted is not available in all hnswlib versions.
# Instead, we reinitialize the index and let future searches filter via
# deleted_ids at query time. The BinaryStore is already compacted above.
# For a full ANN rebuild, the caller should re-run index_files() on all
# files after compact.
logger.info(
"Compact: BinaryStore rebuilt (%d entries kept). "
"Note: ANNIndex retains stale entries; run full re-index for clean ANN state.",
self._binary_store._count,
)

View File

@@ -67,3 +67,28 @@ class FTSEngine:
"SELECT content FROM docs WHERE rowid = ?", (doc_id,)
).fetchone()
return row[0] if row else ""
def get_chunk_ids_by_path(self, path: str) -> list[int]:
"""Return all doc IDs associated with a given file path."""
rows = self._conn.execute(
"SELECT id FROM docs_meta WHERE path = ?", (path,)
).fetchall()
return [r[0] for r in rows]
def delete_by_path(self, path: str) -> int:
"""Delete all docs and docs_meta rows for a given file path.
Returns the number of deleted documents.
"""
ids = self.get_chunk_ids_by_path(path)
if not ids:
return 0
placeholders = ",".join("?" for _ in ids)
self._conn.execute(
f"DELETE FROM docs WHERE rowid IN ({placeholders})", ids
)
self._conn.execute(
f"DELETE FROM docs_meta WHERE id IN ({placeholders})", ids
)
self._conn.commit()
return len(ids)

View File

@@ -9,6 +9,7 @@ import numpy as np
from ..config import Config
from ..core import ANNIndex, BinaryStore
from ..embed import BaseEmbedder
from ..indexing.metadata import MetadataStore
from ..rerank import BaseReranker
from .fts import FTSEngine
from .fusion import (
@@ -38,6 +39,7 @@ class SearchPipeline:
reranker: BaseReranker,
fts: FTSEngine,
config: Config,
metadata_store: MetadataStore | None = None,
) -> None:
self._embedder = embedder
self._binary_store = binary_store
@@ -45,6 +47,7 @@ class SearchPipeline:
self._reranker = reranker
self._fts = fts
self._config = config
self._metadata_store = metadata_store
# -- Helper: vector search (binary coarse + ANN fine) -----------------
@@ -137,6 +140,16 @@ class SearchPipeline:
fused = reciprocal_rank_fusion(fusion_input, weights=weights, k=cfg.fusion_k)
# 4b. Filter out deleted IDs (tombstone filtering)
if self._metadata_store is not None:
deleted_ids = self._metadata_store.get_deleted_ids()
if deleted_ids:
fused = [
(doc_id, score)
for doc_id, score in fused
if doc_id not in deleted_ids
]
# 5. Rerank top candidates
rerank_ids = [doc_id for doc_id, _ in fused[:50]]
contents = [self._fts.get_content(doc_id) for doc_id in rerank_ids]

View File

@@ -0,0 +1,17 @@
"""File watcher and incremental indexer for codexlens-search.
Requires the ``watcher`` extra::
pip install codexlens-search[watcher]
"""
from codexlens_search.watcher.events import ChangeType, FileEvent, WatcherConfig
from codexlens_search.watcher.file_watcher import FileWatcher
from codexlens_search.watcher.incremental_indexer import IncrementalIndexer
__all__ = [
"ChangeType",
"FileEvent",
"FileWatcher",
"IncrementalIndexer",
"WatcherConfig",
]

View File

@@ -0,0 +1,57 @@
"""Event types for file watcher."""
from __future__ import annotations
import time
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import Optional, Set
class ChangeType(Enum):
"""Type of file system change."""
CREATED = "created"
MODIFIED = "modified"
DELETED = "deleted"
@dataclass
class FileEvent:
"""A file system change event."""
path: Path
change_type: ChangeType
timestamp: float = field(default_factory=time.time)
@dataclass
class WatcherConfig:
"""Configuration for file watcher.
Attributes:
debounce_ms: Milliseconds to wait after the last event before
flushing the batch. Default 500ms for low-latency indexing.
ignored_patterns: Directory/file name patterns to skip. Any
path component matching one of these strings is ignored.
"""
debounce_ms: int = 500
ignored_patterns: Set[str] = field(default_factory=lambda: {
# Version control
".git", ".svn", ".hg",
# Python
".venv", "venv", "env", "__pycache__", ".pytest_cache",
".mypy_cache", ".ruff_cache",
# Node.js
"node_modules", "bower_components",
# Build artifacts
"dist", "build", "out", "target", "bin", "obj",
"coverage", "htmlcov",
# IDE / Editor
".idea", ".vscode", ".vs",
# Package / cache
".cache", ".parcel-cache", ".turbo", ".next", ".nuxt",
# Logs / temp
"logs", "tmp", "temp",
})

View File

@@ -0,0 +1,263 @@
"""File system watcher using watchdog library.
Ported from codex-lens v1 with simplifications:
- Removed v1-specific Config dependency (uses WatcherConfig directly)
- Removed MAX_QUEUE_SIZE (v2 processes immediately via debounce)
- Removed flush.signal file mechanism
- Added optional JSONL output mode for bridge CLI integration
"""
from __future__ import annotations
import json
import logging
import sys
import threading
import time
from pathlib import Path
from typing import Callable, Dict, List, Optional
from watchdog.events import FileSystemEventHandler
from watchdog.observers import Observer
from .events import ChangeType, FileEvent, WatcherConfig
logger = logging.getLogger(__name__)
# Event priority for deduplication: higher wins when same file appears
# multiple times within one debounce window.
_EVENT_PRIORITY: Dict[ChangeType, int] = {
ChangeType.CREATED: 1,
ChangeType.MODIFIED: 2,
ChangeType.DELETED: 3,
}
class _Handler(FileSystemEventHandler):
"""Internal watchdog handler that converts events to FileEvent."""
def __init__(self, watcher: FileWatcher) -> None:
super().__init__()
self._watcher = watcher
def on_created(self, event) -> None:
if not event.is_directory:
self._watcher._on_raw_event(event.src_path, ChangeType.CREATED)
def on_modified(self, event) -> None:
if not event.is_directory:
self._watcher._on_raw_event(event.src_path, ChangeType.MODIFIED)
def on_deleted(self, event) -> None:
if not event.is_directory:
self._watcher._on_raw_event(event.src_path, ChangeType.DELETED)
def on_moved(self, event) -> None:
if event.is_directory:
return
# Treat move as delete old + create new
self._watcher._on_raw_event(event.src_path, ChangeType.DELETED)
self._watcher._on_raw_event(event.dest_path, ChangeType.CREATED)
class FileWatcher:
"""File system watcher with debounce and event deduplication.
Monitors a directory recursively using watchdog. Raw events are
collected into a queue. After *debounce_ms* of silence the queue
is flushed: events are deduplicated per-path (keeping the highest
priority change type) and delivered via *on_changes*.
Example::
def handle(events: list[FileEvent]) -> None:
for e in events:
print(e.change_type.value, e.path)
watcher = FileWatcher(Path("."), WatcherConfig(), handle)
watcher.start()
watcher.wait()
"""
def __init__(
self,
root_path: Path,
config: WatcherConfig,
on_changes: Callable[[List[FileEvent]], None],
) -> None:
self.root_path = Path(root_path).resolve()
self.config = config
self.on_changes = on_changes
self._observer: Optional[Observer] = None
self._running = False
self._stop_event = threading.Event()
self._lock = threading.RLock()
# Pending events keyed by resolved path
self._pending: Dict[Path, FileEvent] = {}
self._pending_lock = threading.Lock()
# True-debounce timer: resets on every new event
self._flush_timer: Optional[threading.Timer] = None
# ------------------------------------------------------------------
# Filtering
# ------------------------------------------------------------------
def _should_watch(self, path: Path) -> bool:
"""Return True if *path* should not be ignored."""
parts = path.parts
for pattern in self.config.ignored_patterns:
if pattern in parts:
return False
return True
# ------------------------------------------------------------------
# Event intake (called from watchdog thread)
# ------------------------------------------------------------------
def _on_raw_event(self, raw_path: str, change_type: ChangeType) -> None:
"""Accept a raw watchdog event, filter, and queue with debounce."""
path = Path(raw_path).resolve()
if not self._should_watch(path):
return
event = FileEvent(path=path, change_type=change_type)
with self._pending_lock:
existing = self._pending.get(path)
if existing is None or _EVENT_PRIORITY[change_type] >= _EVENT_PRIORITY[existing.change_type]:
self._pending[path] = event
# Cancel previous timer and start a new one (true debounce)
if self._flush_timer is not None:
self._flush_timer.cancel()
self._flush_timer = threading.Timer(
self.config.debounce_ms / 1000.0,
self._flush,
)
self._flush_timer.daemon = True
self._flush_timer.start()
# ------------------------------------------------------------------
# Flush
# ------------------------------------------------------------------
def _flush(self) -> None:
"""Deduplicate and deliver pending events."""
with self._pending_lock:
if not self._pending:
return
events = list(self._pending.values())
self._pending.clear()
self._flush_timer = None
try:
self.on_changes(events)
except Exception:
logger.exception("Error in on_changes callback")
def flush_now(self) -> None:
"""Immediately flush pending events (manual trigger)."""
with self._pending_lock:
if self._flush_timer is not None:
self._flush_timer.cancel()
self._flush_timer = None
self._flush()
# ------------------------------------------------------------------
# Lifecycle
# ------------------------------------------------------------------
def start(self) -> None:
"""Start watching the directory (non-blocking)."""
with self._lock:
if self._running:
logger.warning("Watcher already running")
return
if not self.root_path.exists():
raise ValueError(f"Root path does not exist: {self.root_path}")
self._observer = Observer()
handler = _Handler(self)
self._observer.schedule(handler, str(self.root_path), recursive=True)
self._running = True
self._stop_event.clear()
self._observer.start()
logger.info("Started watching: %s", self.root_path)
def stop(self) -> None:
"""Stop watching and flush remaining events."""
with self._lock:
if not self._running:
return
self._running = False
self._stop_event.set()
with self._pending_lock:
if self._flush_timer is not None:
self._flush_timer.cancel()
self._flush_timer = None
if self._observer is not None:
self._observer.stop()
self._observer.join(timeout=5.0)
self._observer = None
# Deliver any remaining events
self._flush()
logger.info("Stopped watching: %s", self.root_path)
def wait(self) -> None:
"""Block until stopped (Ctrl+C or stop() from another thread)."""
try:
while self._running:
self._stop_event.wait(timeout=1.0)
except KeyboardInterrupt:
logger.info("Received interrupt, stopping watcher...")
self.stop()
@property
def is_running(self) -> bool:
"""True if the watcher is currently running."""
return self._running
# ------------------------------------------------------------------
# JSONL output helper
# ------------------------------------------------------------------
@staticmethod
def events_to_jsonl(events: List[FileEvent]) -> str:
"""Serialize a batch of events as newline-delimited JSON.
Each line is a JSON object with keys: ``path``, ``change_type``,
``timestamp``. Useful for bridge CLI integration.
"""
lines: list[str] = []
for evt in events:
obj = {
"path": str(evt.path),
"change_type": evt.change_type.value,
"timestamp": evt.timestamp,
}
lines.append(json.dumps(obj, ensure_ascii=False))
return "\n".join(lines)
@staticmethod
def jsonl_callback(events: List[FileEvent]) -> None:
"""Callback that writes JSONL to stdout.
Suitable as *on_changes* when running in bridge/CLI mode::
watcher = FileWatcher(root, config, FileWatcher.jsonl_callback)
"""
output = FileWatcher.events_to_jsonl(events)
if output:
sys.stdout.write(output + "\n")
sys.stdout.flush()

View File

@@ -0,0 +1,129 @@
"""Incremental indexer that processes FileEvents via IndexingPipeline.
Ported from codex-lens v1 with simplifications:
- Uses IndexingPipeline.index_file() / remove_file() directly
- No v1-specific Config, ParserFactory, DirIndexStore dependencies
- Per-file error isolation: one failure does not stop batch processing
"""
from __future__ import annotations
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import List, Optional
from codexlens_search.indexing.pipeline import IndexingPipeline
from .events import ChangeType, FileEvent
logger = logging.getLogger(__name__)
@dataclass
class BatchResult:
"""Result of processing a batch of file events."""
files_indexed: int = 0
files_removed: int = 0
chunks_created: int = 0
errors: List[str] = field(default_factory=list)
@property
def total_processed(self) -> int:
return self.files_indexed + self.files_removed
@property
def has_errors(self) -> bool:
return len(self.errors) > 0
class IncrementalIndexer:
"""Routes file change events to IndexingPipeline operations.
CREATED / MODIFIED events call ``pipeline.index_file()``.
DELETED events call ``pipeline.remove_file()``.
Each file is processed in isolation so that a single failure
does not prevent the rest of the batch from being indexed.
Example::
indexer = IncrementalIndexer(pipeline, root=Path("/project"))
result = indexer.process_events([
FileEvent(Path("src/main.py"), ChangeType.MODIFIED),
])
print(f"Indexed {result.files_indexed}, removed {result.files_removed}")
"""
def __init__(
self,
pipeline: IndexingPipeline,
*,
root: Optional[Path] = None,
) -> None:
"""Initialize the incremental indexer.
Args:
pipeline: The indexing pipeline with metadata store configured.
root: Optional project root for computing relative paths.
If None, absolute paths are used as identifiers.
"""
self._pipeline = pipeline
self._root = root
def process_events(self, events: List[FileEvent]) -> BatchResult:
"""Process a batch of file events with per-file error isolation.
Args:
events: List of file events to process.
Returns:
BatchResult with per-batch statistics.
"""
result = BatchResult()
for event in events:
try:
if event.change_type in (ChangeType.CREATED, ChangeType.MODIFIED):
self._handle_index(event, result)
elif event.change_type == ChangeType.DELETED:
self._handle_remove(event, result)
except Exception as exc:
error_msg = (
f"Error processing {event.path} "
f"({event.change_type.value}): "
f"{type(exc).__name__}: {exc}"
)
logger.error(error_msg)
result.errors.append(error_msg)
if result.total_processed > 0:
logger.info(
"Batch complete: %d indexed, %d removed, %d errors",
result.files_indexed,
result.files_removed,
len(result.errors),
)
return result
def _handle_index(self, event: FileEvent, result: BatchResult) -> None:
"""Index a created or modified file."""
stats = self._pipeline.index_file(
event.path,
root=self._root,
force=(event.change_type == ChangeType.MODIFIED),
)
if stats.files_processed > 0:
result.files_indexed += 1
result.chunks_created += stats.chunks_created
def _handle_remove(self, event: FileEvent, result: BatchResult) -> None:
"""Remove a deleted file from the index."""
rel_path = (
str(event.path.relative_to(self._root))
if self._root
else str(event.path)
)
self._pipeline.remove_file(rel_path)
result.files_removed += 1

View File

@@ -0,0 +1,388 @@
"""Unit tests for IndexingPipeline incremental API (index_file, remove_file, sync, compact)."""
from __future__ import annotations
import tempfile
from pathlib import Path
from unittest.mock import MagicMock
import numpy as np
import pytest
from codexlens_search.config import Config
from codexlens_search.core.binary import BinaryStore
from codexlens_search.core.index import ANNIndex
from codexlens_search.embed.base import BaseEmbedder
from codexlens_search.indexing.metadata import MetadataStore
from codexlens_search.indexing.pipeline import IndexingPipeline, IndexStats
from codexlens_search.search.fts import FTSEngine
DIM = 32
class FakeEmbedder(BaseEmbedder):
"""Deterministic embedder for testing."""
def __init__(self) -> None:
pass
def embed_single(self, text: str) -> np.ndarray:
rng = np.random.default_rng(hash(text) % (2**31))
return rng.standard_normal(DIM).astype(np.float32)
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
return [self.embed_single(t) for t in texts]
@pytest.fixture
def workspace(tmp_path: Path):
"""Create workspace with stores, metadata, and pipeline."""
cfg = Config.small()
# Override embed_dim to match our test dim
cfg.embed_dim = DIM
store_dir = tmp_path / "stores"
store_dir.mkdir()
binary_store = BinaryStore(store_dir, DIM, cfg)
ann_index = ANNIndex(store_dir, DIM, cfg)
fts = FTSEngine(str(store_dir / "fts.db"))
metadata = MetadataStore(str(store_dir / "metadata.db"))
embedder = FakeEmbedder()
pipeline = IndexingPipeline(
embedder=embedder,
binary_store=binary_store,
ann_index=ann_index,
fts=fts,
config=cfg,
metadata=metadata,
)
# Create sample source files
src_dir = tmp_path / "src"
src_dir.mkdir()
return {
"pipeline": pipeline,
"metadata": metadata,
"binary_store": binary_store,
"ann_index": ann_index,
"fts": fts,
"src_dir": src_dir,
"store_dir": store_dir,
"config": cfg,
}
def _write_file(src_dir: Path, name: str, content: str) -> Path:
"""Write a file and return its path."""
p = src_dir / name
p.write_text(content, encoding="utf-8")
return p
# ---------------------------------------------------------------------------
# MetadataStore helper method tests
# ---------------------------------------------------------------------------
class TestMetadataHelpers:
def test_get_all_files_empty(self, workspace):
meta = workspace["metadata"]
assert meta.get_all_files() == {}
def test_get_all_files_after_register(self, workspace):
meta = workspace["metadata"]
meta.register_file("a.py", "hash_a", 1000.0)
meta.register_file("b.py", "hash_b", 2000.0)
result = meta.get_all_files()
assert result == {"a.py": "hash_a", "b.py": "hash_b"}
def test_max_chunk_id_empty(self, workspace):
meta = workspace["metadata"]
assert meta.max_chunk_id() == -1
def test_max_chunk_id_with_chunks(self, workspace):
meta = workspace["metadata"]
meta.register_file("a.py", "hash_a", 1000.0)
meta.register_chunks("a.py", [(0, "h0"), (1, "h1"), (5, "h5")])
assert meta.max_chunk_id() == 5
def test_max_chunk_id_includes_deleted(self, workspace):
meta = workspace["metadata"]
meta.register_file("a.py", "hash_a", 1000.0)
meta.register_chunks("a.py", [(0, "h0"), (3, "h3")])
meta.mark_file_deleted("a.py")
# Chunks moved to deleted_chunks, max should still be 3
assert meta.max_chunk_id() == 3
# ---------------------------------------------------------------------------
# index_file tests
# ---------------------------------------------------------------------------
class TestIndexFile:
def test_index_file_basic(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "hello.py", "print('hello world')\n")
stats = pipeline.index_file(f, root=src_dir)
assert stats.files_processed == 1
assert stats.chunks_created >= 1
assert meta.get_file_hash("hello.py") is not None
assert len(meta.get_chunk_ids_for_file("hello.py")) >= 1
def test_index_file_skips_unchanged(self, workspace):
pipeline = workspace["pipeline"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "same.py", "x = 1\n")
stats1 = pipeline.index_file(f, root=src_dir)
assert stats1.files_processed == 1
stats2 = pipeline.index_file(f, root=src_dir)
assert stats2.files_processed == 0
assert stats2.chunks_created == 0
def test_index_file_force_reindex(self, workspace):
pipeline = workspace["pipeline"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "force.py", "x = 1\n")
pipeline.index_file(f, root=src_dir)
stats = pipeline.index_file(f, root=src_dir, force=True)
assert stats.files_processed == 1
assert stats.chunks_created >= 1
def test_index_file_updates_changed_file(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "changing.py", "version = 1\n")
pipeline.index_file(f, root=src_dir)
old_chunks = meta.get_chunk_ids_for_file("changing.py")
# Modify file
f.write_text("version = 2\nmore code\n", encoding="utf-8")
stats = pipeline.index_file(f, root=src_dir)
assert stats.files_processed == 1
new_chunks = meta.get_chunk_ids_for_file("changing.py")
# Old chunks should have been tombstoned, new ones assigned
assert set(old_chunks) != set(new_chunks)
def test_index_file_registers_in_metadata(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
fts = workspace["fts"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "meta_test.py", "def foo(): pass\n")
pipeline.index_file(f, root=src_dir)
# MetadataStore has file registered
assert meta.get_file_hash("meta_test.py") is not None
chunk_ids = meta.get_chunk_ids_for_file("meta_test.py")
assert len(chunk_ids) >= 1
# FTS has the content
fts_ids = fts.get_chunk_ids_by_path("meta_test.py")
assert len(fts_ids) >= 1
def test_index_file_no_metadata_raises(self, workspace):
cfg = workspace["config"]
pipeline_no_meta = IndexingPipeline(
embedder=FakeEmbedder(),
binary_store=workspace["binary_store"],
ann_index=workspace["ann_index"],
fts=workspace["fts"],
config=cfg,
)
f = _write_file(workspace["src_dir"], "no_meta.py", "x = 1\n")
with pytest.raises(RuntimeError, match="MetadataStore is required"):
pipeline_no_meta.index_file(f)
# ---------------------------------------------------------------------------
# remove_file tests
# ---------------------------------------------------------------------------
class TestRemoveFile:
def test_remove_file_tombstones_and_fts(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
fts = workspace["fts"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "to_remove.py", "data = [1, 2, 3]\n")
pipeline.index_file(f, root=src_dir)
chunk_ids = meta.get_chunk_ids_for_file("to_remove.py")
assert len(chunk_ids) >= 1
pipeline.remove_file("to_remove.py")
# File should be gone from metadata
assert meta.get_file_hash("to_remove.py") is None
assert meta.get_chunk_ids_for_file("to_remove.py") == []
# Chunks should be in deleted_chunks
deleted = meta.get_deleted_ids()
for cid in chunk_ids:
assert cid in deleted
# FTS should be cleared
assert fts.get_chunk_ids_by_path("to_remove.py") == []
def test_remove_nonexistent_file(self, workspace):
pipeline = workspace["pipeline"]
# Should not raise
pipeline.remove_file("nonexistent.py")
# ---------------------------------------------------------------------------
# sync tests
# ---------------------------------------------------------------------------
class TestSync:
def test_sync_indexes_new_files(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
src_dir = workspace["src_dir"]
f1 = _write_file(src_dir, "a.py", "a = 1\n")
f2 = _write_file(src_dir, "b.py", "b = 2\n")
stats = pipeline.sync([f1, f2], root=src_dir)
assert stats.files_processed == 2
assert meta.get_file_hash("a.py") is not None
assert meta.get_file_hash("b.py") is not None
def test_sync_removes_missing_files(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
src_dir = workspace["src_dir"]
f1 = _write_file(src_dir, "keep.py", "keep = True\n")
f2 = _write_file(src_dir, "remove.py", "remove = True\n")
pipeline.sync([f1, f2], root=src_dir)
assert meta.get_file_hash("remove.py") is not None
# Sync with only f1 -- f2 should be removed
stats = pipeline.sync([f1], root=src_dir)
assert meta.get_file_hash("remove.py") is None
deleted = meta.get_deleted_ids()
assert len(deleted) > 0
def test_sync_detects_changed_files(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "mutable.py", "v1\n")
pipeline.sync([f], root=src_dir)
old_hash = meta.get_file_hash("mutable.py")
f.write_text("v2\n", encoding="utf-8")
stats = pipeline.sync([f], root=src_dir)
assert stats.files_processed == 1
new_hash = meta.get_file_hash("mutable.py")
assert old_hash != new_hash
def test_sync_skips_unchanged(self, workspace):
pipeline = workspace["pipeline"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "stable.py", "stable = True\n")
pipeline.sync([f], root=src_dir)
# Second sync with same file, unchanged
stats = pipeline.sync([f], root=src_dir)
assert stats.files_processed == 0
assert stats.chunks_created == 0
# ---------------------------------------------------------------------------
# compact tests
# ---------------------------------------------------------------------------
class TestCompact:
def test_compact_removes_tombstoned_from_binary_store(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
binary_store = workspace["binary_store"]
src_dir = workspace["src_dir"]
f1 = _write_file(src_dir, "alive.py", "alive = True\n")
f2 = _write_file(src_dir, "dead.py", "dead = True\n")
pipeline.index_file(f1, root=src_dir)
pipeline.index_file(f2, root=src_dir)
count_before = binary_store._count
assert count_before >= 2
pipeline.remove_file("dead.py")
pipeline.compact()
# BinaryStore should have fewer entries
assert binary_store._count < count_before
# deleted_chunks should be cleared
assert meta.get_deleted_ids() == set()
def test_compact_noop_when_no_deletions(self, workspace):
pipeline = workspace["pipeline"]
meta = workspace["metadata"]
binary_store = workspace["binary_store"]
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "solo.py", "solo = True\n")
pipeline.index_file(f, root=src_dir)
count_before = binary_store._count
pipeline.compact()
assert binary_store._count == count_before
# ---------------------------------------------------------------------------
# Backward compatibility: existing batch API still works
# ---------------------------------------------------------------------------
class TestBatchAPIUnchanged:
def test_index_files_still_works(self, workspace):
pipeline = workspace["pipeline"]
src_dir = workspace["src_dir"]
f1 = _write_file(src_dir, "batch1.py", "batch1 = 1\n")
f2 = _write_file(src_dir, "batch2.py", "batch2 = 2\n")
stats = pipeline.index_files([f1, f2], root=src_dir)
assert stats.files_processed == 2
assert stats.chunks_created >= 2
def test_index_files_works_without_metadata(self, workspace):
"""Batch API should work even without MetadataStore."""
cfg = workspace["config"]
pipeline_no_meta = IndexingPipeline(
embedder=FakeEmbedder(),
binary_store=BinaryStore(workspace["store_dir"] / "no_meta", DIM, cfg),
ann_index=ANNIndex(workspace["store_dir"] / "no_meta", DIM, cfg),
fts=FTSEngine(str(workspace["store_dir"] / "no_meta_fts.db")),
config=cfg,
)
src_dir = workspace["src_dir"]
f = _write_file(src_dir, "no_meta_batch.py", "x = 1\n")
stats = pipeline_no_meta.index_files([f], root=src_dir)
assert stats.files_processed == 1