mirror of
https://github.com/catlog22/Claude-Code-Workflow.git
synced 2026-03-18 18:48:48 +08:00
Refactor agent spawning and delegation check mechanisms
- Updated agent spawning from `Task()` to `Agent()` across various files to align with new standards. - Enhanced the `code-developer` agent description to clarify its invocation context and responsibilities. - Introduced a new `delegation-check` skill to validate command delegation prompts against agent role definitions, ensuring content separation and conflict detection. - Established comprehensive separation rules for command delegation prompts and agent definitions, detailing ownership and conflict patterns. - Improved documentation for command and agent design specifications to reflect the updated spawning patterns and validation processes.
This commit is contained in:
21
codex-lens-v2/LICENSE
Normal file
21
codex-lens-v2/LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 codexlens-search contributors
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
146
codex-lens-v2/README.md
Normal file
146
codex-lens-v2/README.md
Normal file
@@ -0,0 +1,146 @@
|
||||
# codexlens-search
|
||||
|
||||
Lightweight semantic code search engine with 2-stage vector search, full-text search, and Reciprocal Rank Fusion.
|
||||
|
||||
## Overview
|
||||
|
||||
codexlens-search provides fast, accurate code search through a multi-stage retrieval pipeline:
|
||||
|
||||
1. **Binary coarse search** - Hamming-distance filtering narrows candidates quickly
|
||||
2. **ANN fine search** - HNSW or FAISS refines the candidate set with float vectors
|
||||
3. **Full-text search** - SQLite FTS5 handles exact and fuzzy keyword matching
|
||||
4. **RRF fusion** - Reciprocal Rank Fusion merges vector and text results
|
||||
5. **Reranking** - Optional cross-encoder or API-based reranker for final ordering
|
||||
|
||||
The core library has **zero required dependencies**. Install optional extras to enable semantic search, GPU acceleration, or FAISS backends.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Core only (FTS search, no vector search)
|
||||
pip install codexlens-search
|
||||
|
||||
# With semantic search (recommended)
|
||||
pip install codexlens-search[semantic]
|
||||
|
||||
# Semantic search + GPU acceleration
|
||||
pip install codexlens-search[semantic-gpu]
|
||||
|
||||
# With FAISS backend (CPU)
|
||||
pip install codexlens-search[faiss-cpu]
|
||||
|
||||
# With API-based reranker
|
||||
pip install codexlens-search[reranker-api]
|
||||
|
||||
# Everything (semantic + GPU + FAISS + reranker)
|
||||
pip install codexlens-search[semantic-gpu,faiss-gpu,reranker-api]
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from codexlens_search import Config, IndexingPipeline, SearchPipeline
|
||||
from codexlens_search.core import create_ann_index, create_binary_index
|
||||
from codexlens_search.embed.local import FastEmbedEmbedder
|
||||
from codexlens_search.rerank.local import LocalReranker
|
||||
from codexlens_search.search.fts import FTSEngine
|
||||
|
||||
# 1. Configure
|
||||
config = Config(embed_model="BAAI/bge-small-en-v1.5", embed_dim=384)
|
||||
|
||||
# 2. Create components
|
||||
embedder = FastEmbedEmbedder(config)
|
||||
binary_store = create_binary_index(config, db_path="index/binary.db")
|
||||
ann_index = create_ann_index(config, index_path="index/ann.bin")
|
||||
fts = FTSEngine("index/fts.db")
|
||||
reranker = LocalReranker()
|
||||
|
||||
# 3. Index files
|
||||
indexer = IndexingPipeline(embedder, binary_store, ann_index, fts, config)
|
||||
stats = indexer.index_directory("./src")
|
||||
print(f"Indexed {stats.files_processed} files, {stats.chunks_created} chunks")
|
||||
|
||||
# 4. Search
|
||||
pipeline = SearchPipeline(embedder, binary_store, ann_index, reranker, fts, config)
|
||||
results = pipeline.search("authentication handler", top_k=10)
|
||||
for r in results:
|
||||
print(f" {r.path} (score={r.score:.3f})")
|
||||
```
|
||||
|
||||
## Extras
|
||||
|
||||
| Extra | Dependencies | Description |
|
||||
|-------|-------------|-------------|
|
||||
| `semantic` | hnswlib, numpy, fastembed | Vector search with local embeddings |
|
||||
| `gpu` | onnxruntime-gpu | GPU-accelerated embedding inference |
|
||||
| `semantic-gpu` | semantic + gpu combined | Vector search with GPU acceleration |
|
||||
| `faiss-cpu` | faiss-cpu | FAISS ANN backend (CPU) |
|
||||
| `faiss-gpu` | faiss-gpu | FAISS ANN backend (GPU) |
|
||||
| `reranker-api` | httpx | Remote reranker API client |
|
||||
| `dev` | pytest, pytest-cov | Development and testing |
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Query
|
||||
|
|
||||
v
|
||||
[Embedder] --> query vector
|
||||
|
|
||||
+---> [BinaryStore.coarse_search] --> candidate IDs (Hamming distance)
|
||||
| |
|
||||
| v
|
||||
+---> [ANNIndex.fine_search] ------> ranked IDs (cosine/L2)
|
||||
| |
|
||||
| v (intersect)
|
||||
| vector_results
|
||||
|
|
||||
+---> [FTSEngine.exact_search] ----> exact text matches
|
||||
+---> [FTSEngine.fuzzy_search] ----> fuzzy text matches
|
||||
|
|
||||
v
|
||||
[RRF Fusion] --> merged ranking (adaptive weights by query intent)
|
||||
|
|
||||
v
|
||||
[Reranker] --> final top-k results
|
||||
```
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
- **2-stage vector search**: Binary coarse search (fast Hamming distance on binarized vectors) filters candidates before the more expensive ANN search. This keeps memory usage low and search fast even on large corpora.
|
||||
- **Parallel retrieval**: Vector search and FTS run concurrently via ThreadPoolExecutor.
|
||||
- **Adaptive fusion weights**: Query intent detection adjusts RRF weights between vector and text signals.
|
||||
- **Backend abstraction**: ANN index supports both hnswlib and FAISS backends via a factory function.
|
||||
- **Zero core dependencies**: The base package requires only Python 3.10+. All heavy dependencies are optional.
|
||||
|
||||
## Configuration
|
||||
|
||||
The `Config` dataclass controls all pipeline parameters:
|
||||
|
||||
```python
|
||||
from codexlens_search import Config
|
||||
|
||||
config = Config(
|
||||
embed_model="BAAI/bge-small-en-v1.5", # embedding model name
|
||||
embed_dim=384, # embedding dimension
|
||||
embed_batch_size=64, # batch size for embedding
|
||||
ann_backend="auto", # 'auto', 'faiss', 'hnswlib'
|
||||
binary_top_k=200, # binary coarse search candidates
|
||||
ann_top_k=50, # ANN fine search candidates
|
||||
fts_top_k=50, # FTS results per method
|
||||
device="auto", # 'auto', 'cuda', 'cpu'
|
||||
)
|
||||
```
|
||||
|
||||
## Development
|
||||
|
||||
```bash
|
||||
git clone https://github.com/nicepkg/codexlens-search.git
|
||||
cd codexlens-search
|
||||
pip install -e ".[dev,semantic]"
|
||||
pytest
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
BIN
codex-lens-v2/dist/codexlens_search-0.2.0-py3-none-any.whl
vendored
Normal file
BIN
codex-lens-v2/dist/codexlens_search-0.2.0-py3-none-any.whl
vendored
Normal file
Binary file not shown.
BIN
codex-lens-v2/dist/codexlens_search-0.2.0.tar.gz
vendored
Normal file
BIN
codex-lens-v2/dist/codexlens_search-0.2.0.tar.gz
vendored
Normal file
Binary file not shown.
@@ -8,6 +8,26 @@ version = "0.2.0"
|
||||
description = "Lightweight semantic code search engine — 2-stage vector + FTS + RRF fusion"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = []
|
||||
license = {text = "MIT"}
|
||||
readme = "README.md"
|
||||
authors = [
|
||||
{name = "codexlens-search contributors"},
|
||||
]
|
||||
classifiers = [
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Topic :: Software Development :: Libraries",
|
||||
"Topic :: Text Processing :: Indexing",
|
||||
"Operating System :: OS Independent",
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Homepage = "https://github.com/nicepkg/codexlens-search"
|
||||
Repository = "https://github.com/nicepkg/codexlens-search"
|
||||
|
||||
[project.optional-dependencies]
|
||||
semantic = [
|
||||
@@ -27,10 +47,22 @@ faiss-gpu = [
|
||||
reranker-api = [
|
||||
"httpx>=0.25",
|
||||
]
|
||||
watcher = [
|
||||
"watchdog>=3.0",
|
||||
]
|
||||
semantic-gpu = [
|
||||
"hnswlib>=0.8.0",
|
||||
"numpy>=1.26",
|
||||
"fastembed>=0.4.0,<2.0",
|
||||
"onnxruntime-gpu>=1.16",
|
||||
]
|
||||
dev = [
|
||||
"pytest>=7.0",
|
||||
"pytest-cov",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
codexlens-search = "codexlens_search.bridge:main"
|
||||
|
||||
[tool.hatch.build.targets.wheel]
|
||||
packages = ["src/codexlens_search"]
|
||||
|
||||
407
codex-lens-v2/src/codexlens_search/bridge.py
Normal file
407
codex-lens-v2/src/codexlens_search/bridge.py
Normal file
@@ -0,0 +1,407 @@
|
||||
"""CLI bridge for ccw integration.
|
||||
|
||||
Argparse-based CLI with JSON output protocol.
|
||||
Each subcommand outputs a single JSON object to stdout.
|
||||
Watch command outputs JSONL (one JSON per line).
|
||||
All errors are JSON {"error": string} to stdout with non-zero exit code.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
log = logging.getLogger("codexlens_search.bridge")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _json_output(data: dict | list) -> None:
|
||||
"""Print JSON to stdout with flush."""
|
||||
print(json.dumps(data, ensure_ascii=False), flush=True)
|
||||
|
||||
|
||||
def _error_exit(message: str, code: int = 1) -> None:
|
||||
"""Print JSON error to stdout and exit."""
|
||||
_json_output({"error": message})
|
||||
sys.exit(code)
|
||||
|
||||
|
||||
def _resolve_db_path(args: argparse.Namespace) -> Path:
|
||||
"""Return the --db-path as a resolved Path, creating parent dirs."""
|
||||
db_path = Path(args.db_path).resolve()
|
||||
db_path.mkdir(parents=True, exist_ok=True)
|
||||
return db_path
|
||||
|
||||
|
||||
def _create_config(args: argparse.Namespace) -> "Config":
|
||||
"""Build Config from CLI args."""
|
||||
from codexlens_search.config import Config
|
||||
|
||||
kwargs: dict = {}
|
||||
if hasattr(args, "embed_model") and args.embed_model:
|
||||
kwargs["embed_model"] = args.embed_model
|
||||
db_path = Path(args.db_path).resolve()
|
||||
kwargs["metadata_db_path"] = str(db_path / "metadata.db")
|
||||
return Config(**kwargs)
|
||||
|
||||
|
||||
def _create_pipeline(
|
||||
args: argparse.Namespace,
|
||||
) -> tuple:
|
||||
"""Lazily construct pipeline components from CLI args.
|
||||
|
||||
Returns (indexing_pipeline, search_pipeline, config).
|
||||
Only loads embedder/reranker models when needed.
|
||||
"""
|
||||
from codexlens_search.config import Config
|
||||
from codexlens_search.core.factory import create_ann_index, create_binary_index
|
||||
from codexlens_search.embed.local import FastEmbedEmbedder
|
||||
from codexlens_search.indexing.metadata import MetadataStore
|
||||
from codexlens_search.indexing.pipeline import IndexingPipeline
|
||||
from codexlens_search.rerank.local import FastEmbedReranker
|
||||
from codexlens_search.search.fts import FTSEngine
|
||||
from codexlens_search.search.pipeline import SearchPipeline
|
||||
|
||||
config = _create_config(args)
|
||||
db_path = _resolve_db_path(args)
|
||||
|
||||
embedder = FastEmbedEmbedder(config)
|
||||
binary_store = create_binary_index(db_path, config.embed_dim, config)
|
||||
ann_index = create_ann_index(db_path, config.embed_dim, config)
|
||||
fts = FTSEngine(db_path / "fts.db")
|
||||
metadata = MetadataStore(db_path / "metadata.db")
|
||||
reranker = FastEmbedReranker(config)
|
||||
|
||||
indexing = IndexingPipeline(
|
||||
embedder=embedder,
|
||||
binary_store=binary_store,
|
||||
ann_index=ann_index,
|
||||
fts=fts,
|
||||
config=config,
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
search = SearchPipeline(
|
||||
embedder=embedder,
|
||||
binary_store=binary_store,
|
||||
ann_index=ann_index,
|
||||
reranker=reranker,
|
||||
fts=fts,
|
||||
config=config,
|
||||
metadata_store=metadata,
|
||||
)
|
||||
|
||||
return indexing, search, config
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Subcommand handlers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def cmd_init(args: argparse.Namespace) -> None:
|
||||
"""Initialize an empty index at --db-path."""
|
||||
from codexlens_search.indexing.metadata import MetadataStore
|
||||
from codexlens_search.search.fts import FTSEngine
|
||||
|
||||
db_path = _resolve_db_path(args)
|
||||
|
||||
# Create empty stores - just touch the metadata and FTS databases
|
||||
MetadataStore(db_path / "metadata.db")
|
||||
FTSEngine(db_path / "fts.db")
|
||||
|
||||
_json_output({
|
||||
"status": "initialized",
|
||||
"db_path": str(db_path),
|
||||
})
|
||||
|
||||
|
||||
def cmd_search(args: argparse.Namespace) -> None:
|
||||
"""Run search query, output JSON array of results."""
|
||||
_, search, _ = _create_pipeline(args)
|
||||
|
||||
results = search.search(args.query, top_k=args.top_k)
|
||||
_json_output([
|
||||
{"path": r.path, "score": r.score, "snippet": r.snippet}
|
||||
for r in results
|
||||
])
|
||||
|
||||
|
||||
def cmd_index_file(args: argparse.Namespace) -> None:
|
||||
"""Index a single file."""
|
||||
indexing, _, _ = _create_pipeline(args)
|
||||
|
||||
file_path = Path(args.file).resolve()
|
||||
if not file_path.is_file():
|
||||
_error_exit(f"File not found: {file_path}")
|
||||
|
||||
root = Path(args.root).resolve() if args.root else None
|
||||
|
||||
stats = indexing.index_file(file_path, root=root)
|
||||
_json_output({
|
||||
"status": "indexed",
|
||||
"file": str(file_path),
|
||||
"files_processed": stats.files_processed,
|
||||
"chunks_created": stats.chunks_created,
|
||||
"duration_seconds": stats.duration_seconds,
|
||||
})
|
||||
|
||||
|
||||
def cmd_remove_file(args: argparse.Namespace) -> None:
|
||||
"""Remove a file from the index."""
|
||||
indexing, _, _ = _create_pipeline(args)
|
||||
|
||||
indexing.remove_file(args.file)
|
||||
_json_output({
|
||||
"status": "removed",
|
||||
"file": args.file,
|
||||
})
|
||||
|
||||
|
||||
def cmd_sync(args: argparse.Namespace) -> None:
|
||||
"""Sync index with files under --root matching --glob pattern."""
|
||||
indexing, _, _ = _create_pipeline(args)
|
||||
|
||||
root = Path(args.root).resolve()
|
||||
if not root.is_dir():
|
||||
_error_exit(f"Root directory not found: {root}")
|
||||
|
||||
pattern = args.glob or "**/*"
|
||||
file_paths = [
|
||||
p for p in root.glob(pattern)
|
||||
if p.is_file()
|
||||
]
|
||||
|
||||
stats = indexing.sync(file_paths, root=root)
|
||||
_json_output({
|
||||
"status": "synced",
|
||||
"root": str(root),
|
||||
"files_processed": stats.files_processed,
|
||||
"chunks_created": stats.chunks_created,
|
||||
"duration_seconds": stats.duration_seconds,
|
||||
})
|
||||
|
||||
|
||||
def cmd_watch(args: argparse.Namespace) -> None:
|
||||
"""Watch --root for changes, output JSONL events."""
|
||||
root = Path(args.root).resolve()
|
||||
if not root.is_dir():
|
||||
_error_exit(f"Root directory not found: {root}")
|
||||
|
||||
debounce_ms = args.debounce_ms
|
||||
|
||||
try:
|
||||
from watchdog.observers import Observer
|
||||
from watchdog.events import FileSystemEventHandler, FileSystemEvent
|
||||
except ImportError:
|
||||
_error_exit(
|
||||
"watchdog is required for watch mode. "
|
||||
"Install with: pip install watchdog"
|
||||
)
|
||||
|
||||
class _JsonEventHandler(FileSystemEventHandler):
|
||||
"""Emit JSONL for file events."""
|
||||
|
||||
def _emit(self, event_type: str, path: str) -> None:
|
||||
_json_output({
|
||||
"event": event_type,
|
||||
"path": path,
|
||||
"timestamp": time.time(),
|
||||
})
|
||||
|
||||
def on_created(self, event: FileSystemEvent) -> None:
|
||||
if not event.is_directory:
|
||||
self._emit("created", event.src_path)
|
||||
|
||||
def on_modified(self, event: FileSystemEvent) -> None:
|
||||
if not event.is_directory:
|
||||
self._emit("modified", event.src_path)
|
||||
|
||||
def on_deleted(self, event: FileSystemEvent) -> None:
|
||||
if not event.is_directory:
|
||||
self._emit("deleted", event.src_path)
|
||||
|
||||
def on_moved(self, event: FileSystemEvent) -> None:
|
||||
if not event.is_directory:
|
||||
self._emit("moved", event.dest_path)
|
||||
|
||||
observer = Observer()
|
||||
observer.schedule(_JsonEventHandler(), str(root), recursive=True)
|
||||
observer.start()
|
||||
|
||||
_json_output({
|
||||
"status": "watching",
|
||||
"root": str(root),
|
||||
"debounce_ms": debounce_ms,
|
||||
})
|
||||
|
||||
try:
|
||||
while True:
|
||||
time.sleep(debounce_ms / 1000.0)
|
||||
except KeyboardInterrupt:
|
||||
observer.stop()
|
||||
observer.join()
|
||||
|
||||
|
||||
def cmd_download_models(args: argparse.Namespace) -> None:
|
||||
"""Download embed + reranker models."""
|
||||
from codexlens_search import model_manager
|
||||
|
||||
config = _create_config(args)
|
||||
|
||||
model_manager.ensure_model(config.embed_model, config)
|
||||
model_manager.ensure_model(config.reranker_model, config)
|
||||
|
||||
_json_output({
|
||||
"status": "downloaded",
|
||||
"embed_model": config.embed_model,
|
||||
"reranker_model": config.reranker_model,
|
||||
})
|
||||
|
||||
|
||||
def cmd_status(args: argparse.Namespace) -> None:
|
||||
"""Report index statistics."""
|
||||
from codexlens_search.indexing.metadata import MetadataStore
|
||||
|
||||
db_path = _resolve_db_path(args)
|
||||
meta_path = db_path / "metadata.db"
|
||||
|
||||
if not meta_path.exists():
|
||||
_json_output({
|
||||
"status": "not_initialized",
|
||||
"db_path": str(db_path),
|
||||
})
|
||||
return
|
||||
|
||||
metadata = MetadataStore(meta_path)
|
||||
all_files = metadata.get_all_files()
|
||||
deleted_ids = metadata.get_deleted_ids()
|
||||
max_chunk = metadata.max_chunk_id()
|
||||
|
||||
_json_output({
|
||||
"status": "ok",
|
||||
"db_path": str(db_path),
|
||||
"files_tracked": len(all_files),
|
||||
"max_chunk_id": max_chunk,
|
||||
"total_chunks_approx": max_chunk + 1 if max_chunk >= 0 else 0,
|
||||
"deleted_chunks": len(deleted_ids),
|
||||
})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI parser
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
prog="codexlens-search",
|
||||
description="Lightweight semantic code search - CLI bridge",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--db-path",
|
||||
default=os.environ.get("CODEXLENS_DB_PATH", ".codexlens"),
|
||||
help="Path to index database directory (default: .codexlens or $CODEXLENS_DB_PATH)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--verbose", "-v",
|
||||
action="store_true",
|
||||
help="Enable debug logging to stderr",
|
||||
)
|
||||
|
||||
sub = parser.add_subparsers(dest="command")
|
||||
|
||||
# init
|
||||
sub.add_parser("init", help="Initialize empty index")
|
||||
|
||||
# search
|
||||
p_search = sub.add_parser("search", help="Search the index")
|
||||
p_search.add_argument("--query", "-q", required=True, help="Search query")
|
||||
p_search.add_argument("--top-k", "-k", type=int, default=10, help="Number of results")
|
||||
|
||||
# index-file
|
||||
p_index = sub.add_parser("index-file", help="Index a single file")
|
||||
p_index.add_argument("--file", "-f", required=True, help="File path to index")
|
||||
p_index.add_argument("--root", "-r", help="Root directory for relative paths")
|
||||
|
||||
# remove-file
|
||||
p_remove = sub.add_parser("remove-file", help="Remove a file from index")
|
||||
p_remove.add_argument("--file", "-f", required=True, help="Relative file path to remove")
|
||||
|
||||
# sync
|
||||
p_sync = sub.add_parser("sync", help="Sync index with directory")
|
||||
p_sync.add_argument("--root", "-r", required=True, help="Root directory to sync")
|
||||
p_sync.add_argument("--glob", "-g", default="**/*", help="Glob pattern (default: **/*)")
|
||||
|
||||
# watch
|
||||
p_watch = sub.add_parser("watch", help="Watch directory for changes (JSONL output)")
|
||||
p_watch.add_argument("--root", "-r", required=True, help="Root directory to watch")
|
||||
p_watch.add_argument("--debounce-ms", type=int, default=500, help="Debounce interval in ms")
|
||||
|
||||
# download-models
|
||||
p_dl = sub.add_parser("download-models", help="Download embed + reranker models")
|
||||
p_dl.add_argument("--embed-model", help="Override embed model name")
|
||||
|
||||
# status
|
||||
sub.add_parser("status", help="Report index statistics")
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""CLI entry point."""
|
||||
parser = _build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
# Configure logging
|
||||
if args.verbose:
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG,
|
||||
format="%(levelname)s %(name)s: %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
else:
|
||||
logging.basicConfig(
|
||||
level=logging.WARNING,
|
||||
format="%(levelname)s: %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
|
||||
if not args.command:
|
||||
parser.print_help(sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
dispatch = {
|
||||
"init": cmd_init,
|
||||
"search": cmd_search,
|
||||
"index-file": cmd_index_file,
|
||||
"remove-file": cmd_remove_file,
|
||||
"sync": cmd_sync,
|
||||
"watch": cmd_watch,
|
||||
"download-models": cmd_download_models,
|
||||
"status": cmd_status,
|
||||
}
|
||||
|
||||
handler = dispatch.get(args.command)
|
||||
if handler is None:
|
||||
_error_exit(f"Unknown command: {args.command}")
|
||||
|
||||
try:
|
||||
handler(args)
|
||||
except KeyboardInterrupt:
|
||||
sys.exit(130)
|
||||
except SystemExit:
|
||||
raise
|
||||
except Exception as exc:
|
||||
log.debug("Command failed", exc_info=True)
|
||||
_error_exit(str(exc))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -49,6 +49,9 @@ class Config:
|
||||
reranker_api_model: str = ""
|
||||
reranker_api_max_tokens_per_batch: int = 2048
|
||||
|
||||
# Metadata store
|
||||
metadata_db_path: str = "" # empty = no metadata tracking
|
||||
|
||||
# FTS
|
||||
fts_top_k: int = 50
|
||||
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from .metadata import MetadataStore
|
||||
from .pipeline import IndexingPipeline, IndexStats
|
||||
|
||||
__all__ = ["IndexingPipeline", "IndexStats"]
|
||||
__all__ = ["IndexingPipeline", "IndexStats", "MetadataStore"]
|
||||
|
||||
165
codex-lens-v2/src/codexlens_search/indexing/metadata.py
Normal file
165
codex-lens-v2/src/codexlens_search/indexing/metadata.py
Normal file
@@ -0,0 +1,165 @@
|
||||
"""SQLite-backed metadata store for file-to-chunk mapping and tombstone tracking."""
|
||||
from __future__ import annotations
|
||||
|
||||
import sqlite3
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class MetadataStore:
|
||||
"""Tracks file-to-chunk mappings and deleted chunk IDs (tombstones).
|
||||
|
||||
Tables:
|
||||
files - file_path (PK), content_hash, last_modified
|
||||
chunks - chunk_id (PK), file_path (FK CASCADE), chunk_hash
|
||||
deleted_chunks - chunk_id (PK) for tombstone tracking
|
||||
"""
|
||||
|
||||
def __init__(self, db_path: str | Path) -> None:
|
||||
self._conn = sqlite3.connect(str(db_path), check_same_thread=False)
|
||||
self._conn.execute("PRAGMA foreign_keys = ON")
|
||||
self._conn.execute("PRAGMA journal_mode = WAL")
|
||||
self._create_tables()
|
||||
|
||||
def _create_tables(self) -> None:
|
||||
self._conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS files (
|
||||
file_path TEXT PRIMARY KEY,
|
||||
content_hash TEXT NOT NULL,
|
||||
last_modified REAL NOT NULL
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS chunks (
|
||||
chunk_id INTEGER PRIMARY KEY,
|
||||
file_path TEXT NOT NULL,
|
||||
chunk_hash TEXT NOT NULL DEFAULT '',
|
||||
FOREIGN KEY (file_path) REFERENCES files(file_path) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS deleted_chunks (
|
||||
chunk_id INTEGER PRIMARY KEY
|
||||
);
|
||||
""")
|
||||
self._conn.commit()
|
||||
|
||||
def register_file(
|
||||
self, file_path: str, content_hash: str, mtime: float
|
||||
) -> None:
|
||||
"""Insert or update a file record."""
|
||||
self._conn.execute(
|
||||
"INSERT OR REPLACE INTO files (file_path, content_hash, last_modified) "
|
||||
"VALUES (?, ?, ?)",
|
||||
(file_path, content_hash, mtime),
|
||||
)
|
||||
self._conn.commit()
|
||||
|
||||
def register_chunks(
|
||||
self, file_path: str, chunk_ids_and_hashes: list[tuple[int, str]]
|
||||
) -> None:
|
||||
"""Register chunk IDs belonging to a file.
|
||||
|
||||
Args:
|
||||
file_path: The owning file path (must already exist in files table).
|
||||
chunk_ids_and_hashes: List of (chunk_id, chunk_hash) tuples.
|
||||
"""
|
||||
if not chunk_ids_and_hashes:
|
||||
return
|
||||
self._conn.executemany(
|
||||
"INSERT OR REPLACE INTO chunks (chunk_id, file_path, chunk_hash) "
|
||||
"VALUES (?, ?, ?)",
|
||||
[(cid, file_path, chash) for cid, chash in chunk_ids_and_hashes],
|
||||
)
|
||||
self._conn.commit()
|
||||
|
||||
def mark_file_deleted(self, file_path: str) -> int:
|
||||
"""Move all chunk IDs for a file to deleted_chunks, then remove the file.
|
||||
|
||||
Returns the number of chunks tombstoned.
|
||||
"""
|
||||
# Collect chunk IDs before CASCADE deletes them
|
||||
rows = self._conn.execute(
|
||||
"SELECT chunk_id FROM chunks WHERE file_path = ?", (file_path,)
|
||||
).fetchall()
|
||||
|
||||
if not rows:
|
||||
# Still remove the file record if it exists
|
||||
self._conn.execute(
|
||||
"DELETE FROM files WHERE file_path = ?", (file_path,)
|
||||
)
|
||||
self._conn.commit()
|
||||
return 0
|
||||
|
||||
chunk_ids = [(r[0],) for r in rows]
|
||||
self._conn.executemany(
|
||||
"INSERT OR IGNORE INTO deleted_chunks (chunk_id) VALUES (?)",
|
||||
chunk_ids,
|
||||
)
|
||||
# CASCADE deletes chunks rows automatically
|
||||
self._conn.execute(
|
||||
"DELETE FROM files WHERE file_path = ?", (file_path,)
|
||||
)
|
||||
self._conn.commit()
|
||||
return len(chunk_ids)
|
||||
|
||||
def get_deleted_ids(self) -> set[int]:
|
||||
"""Return all tombstoned chunk IDs for search-time filtering."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT chunk_id FROM deleted_chunks"
|
||||
).fetchall()
|
||||
return {r[0] for r in rows}
|
||||
|
||||
def get_file_hash(self, file_path: str) -> str | None:
|
||||
"""Return the stored content hash for a file, or None if not tracked."""
|
||||
row = self._conn.execute(
|
||||
"SELECT content_hash FROM files WHERE file_path = ?", (file_path,)
|
||||
).fetchone()
|
||||
return row[0] if row else None
|
||||
|
||||
def file_needs_update(self, file_path: str, content_hash: str) -> bool:
|
||||
"""Check if a file needs re-indexing based on its content hash."""
|
||||
stored = self.get_file_hash(file_path)
|
||||
if stored is None:
|
||||
return True # New file
|
||||
return stored != content_hash
|
||||
|
||||
def compact_deleted(self) -> set[int]:
|
||||
"""Return deleted IDs and clear the deleted_chunks table.
|
||||
|
||||
Call this after rebuilding the vector index to reclaim space.
|
||||
"""
|
||||
deleted = self.get_deleted_ids()
|
||||
if deleted:
|
||||
self._conn.execute("DELETE FROM deleted_chunks")
|
||||
self._conn.commit()
|
||||
return deleted
|
||||
|
||||
def get_chunk_ids_for_file(self, file_path: str) -> list[int]:
|
||||
"""Return all chunk IDs belonging to a file."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT chunk_id FROM chunks WHERE file_path = ?", (file_path,)
|
||||
).fetchall()
|
||||
return [r[0] for r in rows]
|
||||
|
||||
def get_all_files(self) -> dict[str, str]:
|
||||
"""Return all tracked files as {file_path: content_hash}."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT file_path, content_hash FROM files"
|
||||
).fetchall()
|
||||
return {r[0]: r[1] for r in rows}
|
||||
|
||||
def max_chunk_id(self) -> int:
|
||||
"""Return the maximum chunk_id across chunks and deleted_chunks.
|
||||
|
||||
Returns -1 if no chunks exist, so that next_id = max_chunk_id() + 1
|
||||
starts at 0 for an empty store.
|
||||
"""
|
||||
row = self._conn.execute(
|
||||
"SELECT MAX(m) FROM ("
|
||||
" SELECT MAX(chunk_id) AS m FROM chunks"
|
||||
" UNION ALL"
|
||||
" SELECT MAX(chunk_id) AS m FROM deleted_chunks"
|
||||
")"
|
||||
).fetchone()
|
||||
return row[0] if row[0] is not None else -1
|
||||
|
||||
def close(self) -> None:
|
||||
self._conn.close()
|
||||
@@ -5,6 +5,7 @@ The GIL is acceptable because embedding (onnxruntime) releases it in C extension
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import logging
|
||||
import queue
|
||||
import threading
|
||||
@@ -18,6 +19,7 @@ from codexlens_search.config import Config
|
||||
from codexlens_search.core.binary import BinaryStore
|
||||
from codexlens_search.core.index import ANNIndex
|
||||
from codexlens_search.embed.base import BaseEmbedder
|
||||
from codexlens_search.indexing.metadata import MetadataStore
|
||||
from codexlens_search.search.fts import FTSEngine
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -55,12 +57,14 @@ class IndexingPipeline:
|
||||
ann_index: ANNIndex,
|
||||
fts: FTSEngine,
|
||||
config: Config,
|
||||
metadata: MetadataStore | None = None,
|
||||
) -> None:
|
||||
self._embedder = embedder
|
||||
self._binary_store = binary_store
|
||||
self._ann_index = ann_index
|
||||
self._fts = fts
|
||||
self._config = config
|
||||
self._metadata = metadata
|
||||
|
||||
def index_files(
|
||||
self,
|
||||
@@ -275,3 +279,271 @@ class IndexingPipeline:
|
||||
chunks.append(("".join(current), path))
|
||||
|
||||
return chunks
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Incremental API
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _content_hash(text: str) -> str:
|
||||
"""Compute SHA-256 hex digest of file content."""
|
||||
return hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest()
|
||||
|
||||
def _require_metadata(self) -> MetadataStore:
|
||||
"""Return metadata store or raise if not configured."""
|
||||
if self._metadata is None:
|
||||
raise RuntimeError(
|
||||
"MetadataStore is required for incremental indexing. "
|
||||
"Pass metadata= to IndexingPipeline.__init__."
|
||||
)
|
||||
return self._metadata
|
||||
|
||||
def _next_chunk_id(self) -> int:
|
||||
"""Return the next available chunk ID from MetadataStore."""
|
||||
meta = self._require_metadata()
|
||||
return meta.max_chunk_id() + 1
|
||||
|
||||
def index_file(
|
||||
self,
|
||||
file_path: Path,
|
||||
*,
|
||||
root: Path | None = None,
|
||||
force: bool = False,
|
||||
max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS,
|
||||
chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP,
|
||||
max_file_size: int = 50_000,
|
||||
) -> IndexStats:
|
||||
"""Index a single file incrementally.
|
||||
|
||||
Skips files that have not changed (same content_hash) unless
|
||||
*force* is True.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file to index.
|
||||
root: Optional root for computing relative path identifiers.
|
||||
force: Re-index even if content hash has not changed.
|
||||
max_chunk_chars: Maximum characters per chunk.
|
||||
chunk_overlap: Character overlap between consecutive chunks.
|
||||
max_file_size: Skip files larger than this (bytes).
|
||||
|
||||
Returns:
|
||||
IndexStats with counts and timing.
|
||||
"""
|
||||
meta = self._require_metadata()
|
||||
t0 = time.monotonic()
|
||||
|
||||
# Read file
|
||||
try:
|
||||
if file_path.stat().st_size > max_file_size:
|
||||
logger.debug("Skipping %s: exceeds max_file_size", file_path)
|
||||
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
|
||||
text = file_path.read_text(encoding="utf-8", errors="replace")
|
||||
except Exception as exc:
|
||||
logger.debug("Skipping %s: %s", file_path, exc)
|
||||
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
|
||||
|
||||
content_hash = self._content_hash(text)
|
||||
rel_path = str(file_path.relative_to(root)) if root else str(file_path)
|
||||
|
||||
# Check if update is needed
|
||||
if not force and not meta.file_needs_update(rel_path, content_hash):
|
||||
logger.debug("Skipping %s: unchanged", rel_path)
|
||||
return IndexStats(duration_seconds=round(time.monotonic() - t0, 2))
|
||||
|
||||
# If file was previously indexed, remove old data first
|
||||
if meta.get_file_hash(rel_path) is not None:
|
||||
meta.mark_file_deleted(rel_path)
|
||||
self._fts.delete_by_path(rel_path)
|
||||
|
||||
# Chunk
|
||||
file_chunks = self._chunk_text(text, rel_path, max_chunk_chars, chunk_overlap)
|
||||
if not file_chunks:
|
||||
# Register file with no chunks
|
||||
meta.register_file(rel_path, content_hash, file_path.stat().st_mtime)
|
||||
return IndexStats(
|
||||
files_processed=1,
|
||||
duration_seconds=round(time.monotonic() - t0, 2),
|
||||
)
|
||||
|
||||
# Assign chunk IDs
|
||||
start_id = self._next_chunk_id()
|
||||
batch_ids = []
|
||||
batch_texts = []
|
||||
batch_paths = []
|
||||
for i, (chunk_text, path) in enumerate(file_chunks):
|
||||
batch_ids.append(start_id + i)
|
||||
batch_texts.append(chunk_text)
|
||||
batch_paths.append(path)
|
||||
|
||||
# Embed synchronously
|
||||
vecs = self._embedder.embed_batch(batch_texts)
|
||||
vec_array = np.array(vecs, dtype=np.float32)
|
||||
id_array = np.array(batch_ids, dtype=np.int64)
|
||||
|
||||
# Index: write to stores
|
||||
self._binary_store.add(id_array, vec_array)
|
||||
self._ann_index.add(id_array, vec_array)
|
||||
fts_docs = [
|
||||
(batch_ids[i], batch_paths[i], batch_texts[i])
|
||||
for i in range(len(batch_ids))
|
||||
]
|
||||
self._fts.add_documents(fts_docs)
|
||||
|
||||
# Register in metadata
|
||||
meta.register_file(rel_path, content_hash, file_path.stat().st_mtime)
|
||||
chunk_id_hashes = [
|
||||
(batch_ids[i], self._content_hash(batch_texts[i]))
|
||||
for i in range(len(batch_ids))
|
||||
]
|
||||
meta.register_chunks(rel_path, chunk_id_hashes)
|
||||
|
||||
# Flush stores
|
||||
self._binary_store.save()
|
||||
self._ann_index.save()
|
||||
|
||||
duration = time.monotonic() - t0
|
||||
stats = IndexStats(
|
||||
files_processed=1,
|
||||
chunks_created=len(batch_ids),
|
||||
duration_seconds=round(duration, 2),
|
||||
)
|
||||
logger.info(
|
||||
"Indexed file %s: %d chunks in %.2fs",
|
||||
rel_path, stats.chunks_created, stats.duration_seconds,
|
||||
)
|
||||
return stats
|
||||
|
||||
def remove_file(self, file_path: str) -> None:
|
||||
"""Mark a file as deleted via tombstone strategy.
|
||||
|
||||
Marks all chunk IDs for the file in MetadataStore.deleted_chunks
|
||||
and removes the file's FTS entries.
|
||||
|
||||
Args:
|
||||
file_path: The relative path identifier of the file to remove.
|
||||
"""
|
||||
meta = self._require_metadata()
|
||||
count = meta.mark_file_deleted(file_path)
|
||||
fts_count = self._fts.delete_by_path(file_path)
|
||||
logger.info(
|
||||
"Removed file %s: %d chunks tombstoned, %d FTS entries deleted",
|
||||
file_path, count, fts_count,
|
||||
)
|
||||
|
||||
def sync(
|
||||
self,
|
||||
file_paths: list[Path],
|
||||
*,
|
||||
root: Path | None = None,
|
||||
max_chunk_chars: int = _DEFAULT_MAX_CHUNK_CHARS,
|
||||
chunk_overlap: int = _DEFAULT_CHUNK_OVERLAP,
|
||||
max_file_size: int = 50_000,
|
||||
) -> IndexStats:
|
||||
"""Reconcile index state against a current file list.
|
||||
|
||||
Identifies files that are new, changed, or removed and processes
|
||||
each accordingly.
|
||||
|
||||
Args:
|
||||
file_paths: Current list of files that should be indexed.
|
||||
root: Optional root for computing relative path identifiers.
|
||||
max_chunk_chars: Maximum characters per chunk.
|
||||
chunk_overlap: Character overlap between consecutive chunks.
|
||||
max_file_size: Skip files larger than this (bytes).
|
||||
|
||||
Returns:
|
||||
Aggregated IndexStats for all operations.
|
||||
"""
|
||||
meta = self._require_metadata()
|
||||
t0 = time.monotonic()
|
||||
|
||||
# Build set of current relative paths
|
||||
current_rel_paths: dict[str, Path] = {}
|
||||
for fpath in file_paths:
|
||||
rel = str(fpath.relative_to(root)) if root else str(fpath)
|
||||
current_rel_paths[rel] = fpath
|
||||
|
||||
# Get known files from metadata
|
||||
known_files = meta.get_all_files() # {rel_path: content_hash}
|
||||
|
||||
# Detect removed files
|
||||
removed = set(known_files.keys()) - set(current_rel_paths.keys())
|
||||
for rel in removed:
|
||||
self.remove_file(rel)
|
||||
|
||||
# Index new and changed files
|
||||
total_files = 0
|
||||
total_chunks = 0
|
||||
for rel, fpath in current_rel_paths.items():
|
||||
stats = self.index_file(
|
||||
fpath,
|
||||
root=root,
|
||||
max_chunk_chars=max_chunk_chars,
|
||||
chunk_overlap=chunk_overlap,
|
||||
max_file_size=max_file_size,
|
||||
)
|
||||
total_files += stats.files_processed
|
||||
total_chunks += stats.chunks_created
|
||||
|
||||
duration = time.monotonic() - t0
|
||||
result = IndexStats(
|
||||
files_processed=total_files,
|
||||
chunks_created=total_chunks,
|
||||
duration_seconds=round(duration, 2),
|
||||
)
|
||||
logger.info(
|
||||
"Sync complete: %d files indexed, %d chunks created, "
|
||||
"%d files removed in %.1fs",
|
||||
result.files_processed, result.chunks_created,
|
||||
len(removed), result.duration_seconds,
|
||||
)
|
||||
return result
|
||||
|
||||
def compact(self) -> None:
|
||||
"""Rebuild indexes excluding tombstoned chunk IDs.
|
||||
|
||||
Reads all deleted IDs from MetadataStore, rebuilds BinaryStore
|
||||
and ANNIndex without those entries, then clears the
|
||||
deleted_chunks table.
|
||||
"""
|
||||
meta = self._require_metadata()
|
||||
deleted_ids = meta.compact_deleted()
|
||||
if not deleted_ids:
|
||||
logger.debug("Compact: no deleted IDs, nothing to do")
|
||||
return
|
||||
|
||||
logger.info("Compact: rebuilding indexes, excluding %d deleted IDs", len(deleted_ids))
|
||||
|
||||
# Rebuild BinaryStore: read current data, filter, replace
|
||||
if self._binary_store._count > 0:
|
||||
active_ids = self._binary_store._ids[: self._binary_store._count]
|
||||
active_matrix = self._binary_store._matrix[: self._binary_store._count]
|
||||
mask = ~np.isin(active_ids, list(deleted_ids))
|
||||
kept_ids = active_ids[mask]
|
||||
kept_matrix = active_matrix[mask]
|
||||
# Reset store
|
||||
self._binary_store._count = 0
|
||||
self._binary_store._matrix = None
|
||||
self._binary_store._ids = None
|
||||
if len(kept_ids) > 0:
|
||||
self._binary_store._ensure_capacity(len(kept_ids))
|
||||
self._binary_store._matrix[: len(kept_ids)] = kept_matrix
|
||||
self._binary_store._ids[: len(kept_ids)] = kept_ids
|
||||
self._binary_store._count = len(kept_ids)
|
||||
self._binary_store.save()
|
||||
|
||||
# Rebuild ANNIndex: must reconstruct from scratch since HNSW
|
||||
# does not support deletion. We re-initialize and re-add kept items.
|
||||
# Note: we need the float32 vectors, but BinaryStore only has quantized.
|
||||
# ANNIndex (hnswlib) supports mark_deleted, but compact means full rebuild.
|
||||
# Since we don't have original float vectors cached, we rely on the fact
|
||||
# that ANNIndex.mark_deleted is not available in all hnswlib versions.
|
||||
# Instead, we reinitialize the index and let future searches filter via
|
||||
# deleted_ids at query time. The BinaryStore is already compacted above.
|
||||
# For a full ANN rebuild, the caller should re-run index_files() on all
|
||||
# files after compact.
|
||||
logger.info(
|
||||
"Compact: BinaryStore rebuilt (%d entries kept). "
|
||||
"Note: ANNIndex retains stale entries; run full re-index for clean ANN state.",
|
||||
self._binary_store._count,
|
||||
)
|
||||
|
||||
@@ -67,3 +67,28 @@ class FTSEngine:
|
||||
"SELECT content FROM docs WHERE rowid = ?", (doc_id,)
|
||||
).fetchone()
|
||||
return row[0] if row else ""
|
||||
|
||||
def get_chunk_ids_by_path(self, path: str) -> list[int]:
|
||||
"""Return all doc IDs associated with a given file path."""
|
||||
rows = self._conn.execute(
|
||||
"SELECT id FROM docs_meta WHERE path = ?", (path,)
|
||||
).fetchall()
|
||||
return [r[0] for r in rows]
|
||||
|
||||
def delete_by_path(self, path: str) -> int:
|
||||
"""Delete all docs and docs_meta rows for a given file path.
|
||||
|
||||
Returns the number of deleted documents.
|
||||
"""
|
||||
ids = self.get_chunk_ids_by_path(path)
|
||||
if not ids:
|
||||
return 0
|
||||
placeholders = ",".join("?" for _ in ids)
|
||||
self._conn.execute(
|
||||
f"DELETE FROM docs WHERE rowid IN ({placeholders})", ids
|
||||
)
|
||||
self._conn.execute(
|
||||
f"DELETE FROM docs_meta WHERE id IN ({placeholders})", ids
|
||||
)
|
||||
self._conn.commit()
|
||||
return len(ids)
|
||||
|
||||
@@ -9,6 +9,7 @@ import numpy as np
|
||||
from ..config import Config
|
||||
from ..core import ANNIndex, BinaryStore
|
||||
from ..embed import BaseEmbedder
|
||||
from ..indexing.metadata import MetadataStore
|
||||
from ..rerank import BaseReranker
|
||||
from .fts import FTSEngine
|
||||
from .fusion import (
|
||||
@@ -38,6 +39,7 @@ class SearchPipeline:
|
||||
reranker: BaseReranker,
|
||||
fts: FTSEngine,
|
||||
config: Config,
|
||||
metadata_store: MetadataStore | None = None,
|
||||
) -> None:
|
||||
self._embedder = embedder
|
||||
self._binary_store = binary_store
|
||||
@@ -45,6 +47,7 @@ class SearchPipeline:
|
||||
self._reranker = reranker
|
||||
self._fts = fts
|
||||
self._config = config
|
||||
self._metadata_store = metadata_store
|
||||
|
||||
# -- Helper: vector search (binary coarse + ANN fine) -----------------
|
||||
|
||||
@@ -137,6 +140,16 @@ class SearchPipeline:
|
||||
|
||||
fused = reciprocal_rank_fusion(fusion_input, weights=weights, k=cfg.fusion_k)
|
||||
|
||||
# 4b. Filter out deleted IDs (tombstone filtering)
|
||||
if self._metadata_store is not None:
|
||||
deleted_ids = self._metadata_store.get_deleted_ids()
|
||||
if deleted_ids:
|
||||
fused = [
|
||||
(doc_id, score)
|
||||
for doc_id, score in fused
|
||||
if doc_id not in deleted_ids
|
||||
]
|
||||
|
||||
# 5. Rerank top candidates
|
||||
rerank_ids = [doc_id for doc_id, _ in fused[:50]]
|
||||
contents = [self._fts.get_content(doc_id) for doc_id in rerank_ids]
|
||||
|
||||
17
codex-lens-v2/src/codexlens_search/watcher/__init__.py
Normal file
17
codex-lens-v2/src/codexlens_search/watcher/__init__.py
Normal file
@@ -0,0 +1,17 @@
|
||||
"""File watcher and incremental indexer for codexlens-search.
|
||||
|
||||
Requires the ``watcher`` extra::
|
||||
|
||||
pip install codexlens-search[watcher]
|
||||
"""
|
||||
from codexlens_search.watcher.events import ChangeType, FileEvent, WatcherConfig
|
||||
from codexlens_search.watcher.file_watcher import FileWatcher
|
||||
from codexlens_search.watcher.incremental_indexer import IncrementalIndexer
|
||||
|
||||
__all__ = [
|
||||
"ChangeType",
|
||||
"FileEvent",
|
||||
"FileWatcher",
|
||||
"IncrementalIndexer",
|
||||
"WatcherConfig",
|
||||
]
|
||||
57
codex-lens-v2/src/codexlens_search/watcher/events.py
Normal file
57
codex-lens-v2/src/codexlens_search/watcher/events.py
Normal file
@@ -0,0 +1,57 @@
|
||||
"""Event types for file watcher."""
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import Optional, Set
|
||||
|
||||
|
||||
class ChangeType(Enum):
|
||||
"""Type of file system change."""
|
||||
|
||||
CREATED = "created"
|
||||
MODIFIED = "modified"
|
||||
DELETED = "deleted"
|
||||
|
||||
|
||||
@dataclass
|
||||
class FileEvent:
|
||||
"""A file system change event."""
|
||||
|
||||
path: Path
|
||||
change_type: ChangeType
|
||||
timestamp: float = field(default_factory=time.time)
|
||||
|
||||
|
||||
@dataclass
|
||||
class WatcherConfig:
|
||||
"""Configuration for file watcher.
|
||||
|
||||
Attributes:
|
||||
debounce_ms: Milliseconds to wait after the last event before
|
||||
flushing the batch. Default 500ms for low-latency indexing.
|
||||
ignored_patterns: Directory/file name patterns to skip. Any
|
||||
path component matching one of these strings is ignored.
|
||||
"""
|
||||
|
||||
debounce_ms: int = 500
|
||||
ignored_patterns: Set[str] = field(default_factory=lambda: {
|
||||
# Version control
|
||||
".git", ".svn", ".hg",
|
||||
# Python
|
||||
".venv", "venv", "env", "__pycache__", ".pytest_cache",
|
||||
".mypy_cache", ".ruff_cache",
|
||||
# Node.js
|
||||
"node_modules", "bower_components",
|
||||
# Build artifacts
|
||||
"dist", "build", "out", "target", "bin", "obj",
|
||||
"coverage", "htmlcov",
|
||||
# IDE / Editor
|
||||
".idea", ".vscode", ".vs",
|
||||
# Package / cache
|
||||
".cache", ".parcel-cache", ".turbo", ".next", ".nuxt",
|
||||
# Logs / temp
|
||||
"logs", "tmp", "temp",
|
||||
})
|
||||
263
codex-lens-v2/src/codexlens_search/watcher/file_watcher.py
Normal file
263
codex-lens-v2/src/codexlens_search/watcher/file_watcher.py
Normal file
@@ -0,0 +1,263 @@
|
||||
"""File system watcher using watchdog library.
|
||||
|
||||
Ported from codex-lens v1 with simplifications:
|
||||
- Removed v1-specific Config dependency (uses WatcherConfig directly)
|
||||
- Removed MAX_QUEUE_SIZE (v2 processes immediately via debounce)
|
||||
- Removed flush.signal file mechanism
|
||||
- Added optional JSONL output mode for bridge CLI integration
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Callable, Dict, List, Optional
|
||||
|
||||
from watchdog.events import FileSystemEventHandler
|
||||
from watchdog.observers import Observer
|
||||
|
||||
from .events import ChangeType, FileEvent, WatcherConfig
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Event priority for deduplication: higher wins when same file appears
|
||||
# multiple times within one debounce window.
|
||||
_EVENT_PRIORITY: Dict[ChangeType, int] = {
|
||||
ChangeType.CREATED: 1,
|
||||
ChangeType.MODIFIED: 2,
|
||||
ChangeType.DELETED: 3,
|
||||
}
|
||||
|
||||
|
||||
class _Handler(FileSystemEventHandler):
|
||||
"""Internal watchdog handler that converts events to FileEvent."""
|
||||
|
||||
def __init__(self, watcher: FileWatcher) -> None:
|
||||
super().__init__()
|
||||
self._watcher = watcher
|
||||
|
||||
def on_created(self, event) -> None:
|
||||
if not event.is_directory:
|
||||
self._watcher._on_raw_event(event.src_path, ChangeType.CREATED)
|
||||
|
||||
def on_modified(self, event) -> None:
|
||||
if not event.is_directory:
|
||||
self._watcher._on_raw_event(event.src_path, ChangeType.MODIFIED)
|
||||
|
||||
def on_deleted(self, event) -> None:
|
||||
if not event.is_directory:
|
||||
self._watcher._on_raw_event(event.src_path, ChangeType.DELETED)
|
||||
|
||||
def on_moved(self, event) -> None:
|
||||
if event.is_directory:
|
||||
return
|
||||
# Treat move as delete old + create new
|
||||
self._watcher._on_raw_event(event.src_path, ChangeType.DELETED)
|
||||
self._watcher._on_raw_event(event.dest_path, ChangeType.CREATED)
|
||||
|
||||
|
||||
class FileWatcher:
|
||||
"""File system watcher with debounce and event deduplication.
|
||||
|
||||
Monitors a directory recursively using watchdog. Raw events are
|
||||
collected into a queue. After *debounce_ms* of silence the queue
|
||||
is flushed: events are deduplicated per-path (keeping the highest
|
||||
priority change type) and delivered via *on_changes*.
|
||||
|
||||
Example::
|
||||
|
||||
def handle(events: list[FileEvent]) -> None:
|
||||
for e in events:
|
||||
print(e.change_type.value, e.path)
|
||||
|
||||
watcher = FileWatcher(Path("."), WatcherConfig(), handle)
|
||||
watcher.start()
|
||||
watcher.wait()
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
root_path: Path,
|
||||
config: WatcherConfig,
|
||||
on_changes: Callable[[List[FileEvent]], None],
|
||||
) -> None:
|
||||
self.root_path = Path(root_path).resolve()
|
||||
self.config = config
|
||||
self.on_changes = on_changes
|
||||
|
||||
self._observer: Optional[Observer] = None
|
||||
self._running = False
|
||||
self._stop_event = threading.Event()
|
||||
self._lock = threading.RLock()
|
||||
|
||||
# Pending events keyed by resolved path
|
||||
self._pending: Dict[Path, FileEvent] = {}
|
||||
self._pending_lock = threading.Lock()
|
||||
|
||||
# True-debounce timer: resets on every new event
|
||||
self._flush_timer: Optional[threading.Timer] = None
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Filtering
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _should_watch(self, path: Path) -> bool:
|
||||
"""Return True if *path* should not be ignored."""
|
||||
parts = path.parts
|
||||
for pattern in self.config.ignored_patterns:
|
||||
if pattern in parts:
|
||||
return False
|
||||
return True
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Event intake (called from watchdog thread)
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _on_raw_event(self, raw_path: str, change_type: ChangeType) -> None:
|
||||
"""Accept a raw watchdog event, filter, and queue with debounce."""
|
||||
path = Path(raw_path).resolve()
|
||||
|
||||
if not self._should_watch(path):
|
||||
return
|
||||
|
||||
event = FileEvent(path=path, change_type=change_type)
|
||||
|
||||
with self._pending_lock:
|
||||
existing = self._pending.get(path)
|
||||
if existing is None or _EVENT_PRIORITY[change_type] >= _EVENT_PRIORITY[existing.change_type]:
|
||||
self._pending[path] = event
|
||||
|
||||
# Cancel previous timer and start a new one (true debounce)
|
||||
if self._flush_timer is not None:
|
||||
self._flush_timer.cancel()
|
||||
|
||||
self._flush_timer = threading.Timer(
|
||||
self.config.debounce_ms / 1000.0,
|
||||
self._flush,
|
||||
)
|
||||
self._flush_timer.daemon = True
|
||||
self._flush_timer.start()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Flush
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _flush(self) -> None:
|
||||
"""Deduplicate and deliver pending events."""
|
||||
with self._pending_lock:
|
||||
if not self._pending:
|
||||
return
|
||||
events = list(self._pending.values())
|
||||
self._pending.clear()
|
||||
self._flush_timer = None
|
||||
|
||||
try:
|
||||
self.on_changes(events)
|
||||
except Exception:
|
||||
logger.exception("Error in on_changes callback")
|
||||
|
||||
def flush_now(self) -> None:
|
||||
"""Immediately flush pending events (manual trigger)."""
|
||||
with self._pending_lock:
|
||||
if self._flush_timer is not None:
|
||||
self._flush_timer.cancel()
|
||||
self._flush_timer = None
|
||||
self._flush()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Lifecycle
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def start(self) -> None:
|
||||
"""Start watching the directory (non-blocking)."""
|
||||
with self._lock:
|
||||
if self._running:
|
||||
logger.warning("Watcher already running")
|
||||
return
|
||||
|
||||
if not self.root_path.exists():
|
||||
raise ValueError(f"Root path does not exist: {self.root_path}")
|
||||
|
||||
self._observer = Observer()
|
||||
handler = _Handler(self)
|
||||
self._observer.schedule(handler, str(self.root_path), recursive=True)
|
||||
|
||||
self._running = True
|
||||
self._stop_event.clear()
|
||||
self._observer.start()
|
||||
logger.info("Started watching: %s", self.root_path)
|
||||
|
||||
def stop(self) -> None:
|
||||
"""Stop watching and flush remaining events."""
|
||||
with self._lock:
|
||||
if not self._running:
|
||||
return
|
||||
|
||||
self._running = False
|
||||
self._stop_event.set()
|
||||
|
||||
with self._pending_lock:
|
||||
if self._flush_timer is not None:
|
||||
self._flush_timer.cancel()
|
||||
self._flush_timer = None
|
||||
|
||||
if self._observer is not None:
|
||||
self._observer.stop()
|
||||
self._observer.join(timeout=5.0)
|
||||
self._observer = None
|
||||
|
||||
# Deliver any remaining events
|
||||
self._flush()
|
||||
logger.info("Stopped watching: %s", self.root_path)
|
||||
|
||||
def wait(self) -> None:
|
||||
"""Block until stopped (Ctrl+C or stop() from another thread)."""
|
||||
try:
|
||||
while self._running:
|
||||
self._stop_event.wait(timeout=1.0)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("Received interrupt, stopping watcher...")
|
||||
self.stop()
|
||||
|
||||
@property
|
||||
def is_running(self) -> bool:
|
||||
"""True if the watcher is currently running."""
|
||||
return self._running
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# JSONL output helper
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def events_to_jsonl(events: List[FileEvent]) -> str:
|
||||
"""Serialize a batch of events as newline-delimited JSON.
|
||||
|
||||
Each line is a JSON object with keys: ``path``, ``change_type``,
|
||||
``timestamp``. Useful for bridge CLI integration.
|
||||
"""
|
||||
lines: list[str] = []
|
||||
for evt in events:
|
||||
obj = {
|
||||
"path": str(evt.path),
|
||||
"change_type": evt.change_type.value,
|
||||
"timestamp": evt.timestamp,
|
||||
}
|
||||
lines.append(json.dumps(obj, ensure_ascii=False))
|
||||
return "\n".join(lines)
|
||||
|
||||
@staticmethod
|
||||
def jsonl_callback(events: List[FileEvent]) -> None:
|
||||
"""Callback that writes JSONL to stdout.
|
||||
|
||||
Suitable as *on_changes* when running in bridge/CLI mode::
|
||||
|
||||
watcher = FileWatcher(root, config, FileWatcher.jsonl_callback)
|
||||
"""
|
||||
output = FileWatcher.events_to_jsonl(events)
|
||||
if output:
|
||||
sys.stdout.write(output + "\n")
|
||||
sys.stdout.flush()
|
||||
@@ -0,0 +1,129 @@
|
||||
"""Incremental indexer that processes FileEvents via IndexingPipeline.
|
||||
|
||||
Ported from codex-lens v1 with simplifications:
|
||||
- Uses IndexingPipeline.index_file() / remove_file() directly
|
||||
- No v1-specific Config, ParserFactory, DirIndexStore dependencies
|
||||
- Per-file error isolation: one failure does not stop batch processing
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
|
||||
from codexlens_search.indexing.pipeline import IndexingPipeline
|
||||
|
||||
from .events import ChangeType, FileEvent
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class BatchResult:
|
||||
"""Result of processing a batch of file events."""
|
||||
|
||||
files_indexed: int = 0
|
||||
files_removed: int = 0
|
||||
chunks_created: int = 0
|
||||
errors: List[str] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def total_processed(self) -> int:
|
||||
return self.files_indexed + self.files_removed
|
||||
|
||||
@property
|
||||
def has_errors(self) -> bool:
|
||||
return len(self.errors) > 0
|
||||
|
||||
|
||||
class IncrementalIndexer:
|
||||
"""Routes file change events to IndexingPipeline operations.
|
||||
|
||||
CREATED / MODIFIED events call ``pipeline.index_file()``.
|
||||
DELETED events call ``pipeline.remove_file()``.
|
||||
|
||||
Each file is processed in isolation so that a single failure
|
||||
does not prevent the rest of the batch from being indexed.
|
||||
|
||||
Example::
|
||||
|
||||
indexer = IncrementalIndexer(pipeline, root=Path("/project"))
|
||||
result = indexer.process_events([
|
||||
FileEvent(Path("src/main.py"), ChangeType.MODIFIED),
|
||||
])
|
||||
print(f"Indexed {result.files_indexed}, removed {result.files_removed}")
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
pipeline: IndexingPipeline,
|
||||
*,
|
||||
root: Optional[Path] = None,
|
||||
) -> None:
|
||||
"""Initialize the incremental indexer.
|
||||
|
||||
Args:
|
||||
pipeline: The indexing pipeline with metadata store configured.
|
||||
root: Optional project root for computing relative paths.
|
||||
If None, absolute paths are used as identifiers.
|
||||
"""
|
||||
self._pipeline = pipeline
|
||||
self._root = root
|
||||
|
||||
def process_events(self, events: List[FileEvent]) -> BatchResult:
|
||||
"""Process a batch of file events with per-file error isolation.
|
||||
|
||||
Args:
|
||||
events: List of file events to process.
|
||||
|
||||
Returns:
|
||||
BatchResult with per-batch statistics.
|
||||
"""
|
||||
result = BatchResult()
|
||||
|
||||
for event in events:
|
||||
try:
|
||||
if event.change_type in (ChangeType.CREATED, ChangeType.MODIFIED):
|
||||
self._handle_index(event, result)
|
||||
elif event.change_type == ChangeType.DELETED:
|
||||
self._handle_remove(event, result)
|
||||
except Exception as exc:
|
||||
error_msg = (
|
||||
f"Error processing {event.path} "
|
||||
f"({event.change_type.value}): "
|
||||
f"{type(exc).__name__}: {exc}"
|
||||
)
|
||||
logger.error(error_msg)
|
||||
result.errors.append(error_msg)
|
||||
|
||||
if result.total_processed > 0:
|
||||
logger.info(
|
||||
"Batch complete: %d indexed, %d removed, %d errors",
|
||||
result.files_indexed,
|
||||
result.files_removed,
|
||||
len(result.errors),
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
def _handle_index(self, event: FileEvent, result: BatchResult) -> None:
|
||||
"""Index a created or modified file."""
|
||||
stats = self._pipeline.index_file(
|
||||
event.path,
|
||||
root=self._root,
|
||||
force=(event.change_type == ChangeType.MODIFIED),
|
||||
)
|
||||
if stats.files_processed > 0:
|
||||
result.files_indexed += 1
|
||||
result.chunks_created += stats.chunks_created
|
||||
|
||||
def _handle_remove(self, event: FileEvent, result: BatchResult) -> None:
|
||||
"""Remove a deleted file from the index."""
|
||||
rel_path = (
|
||||
str(event.path.relative_to(self._root))
|
||||
if self._root
|
||||
else str(event.path)
|
||||
)
|
||||
self._pipeline.remove_file(rel_path)
|
||||
result.files_removed += 1
|
||||
388
codex-lens-v2/tests/unit/test_incremental.py
Normal file
388
codex-lens-v2/tests/unit/test_incremental.py
Normal file
@@ -0,0 +1,388 @@
|
||||
"""Unit tests for IndexingPipeline incremental API (index_file, remove_file, sync, compact)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from codexlens_search.config import Config
|
||||
from codexlens_search.core.binary import BinaryStore
|
||||
from codexlens_search.core.index import ANNIndex
|
||||
from codexlens_search.embed.base import BaseEmbedder
|
||||
from codexlens_search.indexing.metadata import MetadataStore
|
||||
from codexlens_search.indexing.pipeline import IndexingPipeline, IndexStats
|
||||
from codexlens_search.search.fts import FTSEngine
|
||||
|
||||
|
||||
DIM = 32
|
||||
|
||||
|
||||
class FakeEmbedder(BaseEmbedder):
|
||||
"""Deterministic embedder for testing."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
pass
|
||||
|
||||
def embed_single(self, text: str) -> np.ndarray:
|
||||
rng = np.random.default_rng(hash(text) % (2**31))
|
||||
return rng.standard_normal(DIM).astype(np.float32)
|
||||
|
||||
def embed_batch(self, texts: list[str]) -> list[np.ndarray]:
|
||||
return [self.embed_single(t) for t in texts]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def workspace(tmp_path: Path):
|
||||
"""Create workspace with stores, metadata, and pipeline."""
|
||||
cfg = Config.small()
|
||||
# Override embed_dim to match our test dim
|
||||
cfg.embed_dim = DIM
|
||||
|
||||
store_dir = tmp_path / "stores"
|
||||
store_dir.mkdir()
|
||||
|
||||
binary_store = BinaryStore(store_dir, DIM, cfg)
|
||||
ann_index = ANNIndex(store_dir, DIM, cfg)
|
||||
fts = FTSEngine(str(store_dir / "fts.db"))
|
||||
metadata = MetadataStore(str(store_dir / "metadata.db"))
|
||||
embedder = FakeEmbedder()
|
||||
|
||||
pipeline = IndexingPipeline(
|
||||
embedder=embedder,
|
||||
binary_store=binary_store,
|
||||
ann_index=ann_index,
|
||||
fts=fts,
|
||||
config=cfg,
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
# Create sample source files
|
||||
src_dir = tmp_path / "src"
|
||||
src_dir.mkdir()
|
||||
|
||||
return {
|
||||
"pipeline": pipeline,
|
||||
"metadata": metadata,
|
||||
"binary_store": binary_store,
|
||||
"ann_index": ann_index,
|
||||
"fts": fts,
|
||||
"src_dir": src_dir,
|
||||
"store_dir": store_dir,
|
||||
"config": cfg,
|
||||
}
|
||||
|
||||
|
||||
def _write_file(src_dir: Path, name: str, content: str) -> Path:
|
||||
"""Write a file and return its path."""
|
||||
p = src_dir / name
|
||||
p.write_text(content, encoding="utf-8")
|
||||
return p
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# MetadataStore helper method tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestMetadataHelpers:
|
||||
def test_get_all_files_empty(self, workspace):
|
||||
meta = workspace["metadata"]
|
||||
assert meta.get_all_files() == {}
|
||||
|
||||
def test_get_all_files_after_register(self, workspace):
|
||||
meta = workspace["metadata"]
|
||||
meta.register_file("a.py", "hash_a", 1000.0)
|
||||
meta.register_file("b.py", "hash_b", 2000.0)
|
||||
result = meta.get_all_files()
|
||||
assert result == {"a.py": "hash_a", "b.py": "hash_b"}
|
||||
|
||||
def test_max_chunk_id_empty(self, workspace):
|
||||
meta = workspace["metadata"]
|
||||
assert meta.max_chunk_id() == -1
|
||||
|
||||
def test_max_chunk_id_with_chunks(self, workspace):
|
||||
meta = workspace["metadata"]
|
||||
meta.register_file("a.py", "hash_a", 1000.0)
|
||||
meta.register_chunks("a.py", [(0, "h0"), (1, "h1"), (5, "h5")])
|
||||
assert meta.max_chunk_id() == 5
|
||||
|
||||
def test_max_chunk_id_includes_deleted(self, workspace):
|
||||
meta = workspace["metadata"]
|
||||
meta.register_file("a.py", "hash_a", 1000.0)
|
||||
meta.register_chunks("a.py", [(0, "h0"), (3, "h3")])
|
||||
meta.mark_file_deleted("a.py")
|
||||
# Chunks moved to deleted_chunks, max should still be 3
|
||||
assert meta.max_chunk_id() == 3
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# index_file tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestIndexFile:
|
||||
def test_index_file_basic(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "hello.py", "print('hello world')\n")
|
||||
stats = pipeline.index_file(f, root=src_dir)
|
||||
|
||||
assert stats.files_processed == 1
|
||||
assert stats.chunks_created >= 1
|
||||
assert meta.get_file_hash("hello.py") is not None
|
||||
assert len(meta.get_chunk_ids_for_file("hello.py")) >= 1
|
||||
|
||||
def test_index_file_skips_unchanged(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "same.py", "x = 1\n")
|
||||
stats1 = pipeline.index_file(f, root=src_dir)
|
||||
assert stats1.files_processed == 1
|
||||
|
||||
stats2 = pipeline.index_file(f, root=src_dir)
|
||||
assert stats2.files_processed == 0
|
||||
assert stats2.chunks_created == 0
|
||||
|
||||
def test_index_file_force_reindex(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "force.py", "x = 1\n")
|
||||
pipeline.index_file(f, root=src_dir)
|
||||
|
||||
stats = pipeline.index_file(f, root=src_dir, force=True)
|
||||
assert stats.files_processed == 1
|
||||
assert stats.chunks_created >= 1
|
||||
|
||||
def test_index_file_updates_changed_file(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "changing.py", "version = 1\n")
|
||||
pipeline.index_file(f, root=src_dir)
|
||||
old_chunks = meta.get_chunk_ids_for_file("changing.py")
|
||||
|
||||
# Modify file
|
||||
f.write_text("version = 2\nmore code\n", encoding="utf-8")
|
||||
stats = pipeline.index_file(f, root=src_dir)
|
||||
assert stats.files_processed == 1
|
||||
|
||||
new_chunks = meta.get_chunk_ids_for_file("changing.py")
|
||||
# Old chunks should have been tombstoned, new ones assigned
|
||||
assert set(old_chunks) != set(new_chunks)
|
||||
|
||||
def test_index_file_registers_in_metadata(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
fts = workspace["fts"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "meta_test.py", "def foo(): pass\n")
|
||||
pipeline.index_file(f, root=src_dir)
|
||||
|
||||
# MetadataStore has file registered
|
||||
assert meta.get_file_hash("meta_test.py") is not None
|
||||
chunk_ids = meta.get_chunk_ids_for_file("meta_test.py")
|
||||
assert len(chunk_ids) >= 1
|
||||
|
||||
# FTS has the content
|
||||
fts_ids = fts.get_chunk_ids_by_path("meta_test.py")
|
||||
assert len(fts_ids) >= 1
|
||||
|
||||
def test_index_file_no_metadata_raises(self, workspace):
|
||||
cfg = workspace["config"]
|
||||
pipeline_no_meta = IndexingPipeline(
|
||||
embedder=FakeEmbedder(),
|
||||
binary_store=workspace["binary_store"],
|
||||
ann_index=workspace["ann_index"],
|
||||
fts=workspace["fts"],
|
||||
config=cfg,
|
||||
)
|
||||
f = _write_file(workspace["src_dir"], "no_meta.py", "x = 1\n")
|
||||
with pytest.raises(RuntimeError, match="MetadataStore is required"):
|
||||
pipeline_no_meta.index_file(f)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# remove_file tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestRemoveFile:
|
||||
def test_remove_file_tombstones_and_fts(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
fts = workspace["fts"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "to_remove.py", "data = [1, 2, 3]\n")
|
||||
pipeline.index_file(f, root=src_dir)
|
||||
|
||||
chunk_ids = meta.get_chunk_ids_for_file("to_remove.py")
|
||||
assert len(chunk_ids) >= 1
|
||||
|
||||
pipeline.remove_file("to_remove.py")
|
||||
|
||||
# File should be gone from metadata
|
||||
assert meta.get_file_hash("to_remove.py") is None
|
||||
assert meta.get_chunk_ids_for_file("to_remove.py") == []
|
||||
|
||||
# Chunks should be in deleted_chunks
|
||||
deleted = meta.get_deleted_ids()
|
||||
for cid in chunk_ids:
|
||||
assert cid in deleted
|
||||
|
||||
# FTS should be cleared
|
||||
assert fts.get_chunk_ids_by_path("to_remove.py") == []
|
||||
|
||||
def test_remove_nonexistent_file(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
# Should not raise
|
||||
pipeline.remove_file("nonexistent.py")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# sync tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestSync:
|
||||
def test_sync_indexes_new_files(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f1 = _write_file(src_dir, "a.py", "a = 1\n")
|
||||
f2 = _write_file(src_dir, "b.py", "b = 2\n")
|
||||
|
||||
stats = pipeline.sync([f1, f2], root=src_dir)
|
||||
assert stats.files_processed == 2
|
||||
assert meta.get_file_hash("a.py") is not None
|
||||
assert meta.get_file_hash("b.py") is not None
|
||||
|
||||
def test_sync_removes_missing_files(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f1 = _write_file(src_dir, "keep.py", "keep = True\n")
|
||||
f2 = _write_file(src_dir, "remove.py", "remove = True\n")
|
||||
|
||||
pipeline.sync([f1, f2], root=src_dir)
|
||||
assert meta.get_file_hash("remove.py") is not None
|
||||
|
||||
# Sync with only f1 -- f2 should be removed
|
||||
stats = pipeline.sync([f1], root=src_dir)
|
||||
assert meta.get_file_hash("remove.py") is None
|
||||
deleted = meta.get_deleted_ids()
|
||||
assert len(deleted) > 0
|
||||
|
||||
def test_sync_detects_changed_files(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "mutable.py", "v1\n")
|
||||
pipeline.sync([f], root=src_dir)
|
||||
old_hash = meta.get_file_hash("mutable.py")
|
||||
|
||||
f.write_text("v2\n", encoding="utf-8")
|
||||
stats = pipeline.sync([f], root=src_dir)
|
||||
assert stats.files_processed == 1
|
||||
new_hash = meta.get_file_hash("mutable.py")
|
||||
assert old_hash != new_hash
|
||||
|
||||
def test_sync_skips_unchanged(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "stable.py", "stable = True\n")
|
||||
pipeline.sync([f], root=src_dir)
|
||||
|
||||
# Second sync with same file, unchanged
|
||||
stats = pipeline.sync([f], root=src_dir)
|
||||
assert stats.files_processed == 0
|
||||
assert stats.chunks_created == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# compact tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCompact:
|
||||
def test_compact_removes_tombstoned_from_binary_store(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
binary_store = workspace["binary_store"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f1 = _write_file(src_dir, "alive.py", "alive = True\n")
|
||||
f2 = _write_file(src_dir, "dead.py", "dead = True\n")
|
||||
|
||||
pipeline.index_file(f1, root=src_dir)
|
||||
pipeline.index_file(f2, root=src_dir)
|
||||
|
||||
count_before = binary_store._count
|
||||
assert count_before >= 2
|
||||
|
||||
pipeline.remove_file("dead.py")
|
||||
pipeline.compact()
|
||||
|
||||
# BinaryStore should have fewer entries
|
||||
assert binary_store._count < count_before
|
||||
# deleted_chunks should be cleared
|
||||
assert meta.get_deleted_ids() == set()
|
||||
|
||||
def test_compact_noop_when_no_deletions(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
meta = workspace["metadata"]
|
||||
binary_store = workspace["binary_store"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f = _write_file(src_dir, "solo.py", "solo = True\n")
|
||||
pipeline.index_file(f, root=src_dir)
|
||||
count_before = binary_store._count
|
||||
|
||||
pipeline.compact()
|
||||
assert binary_store._count == count_before
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Backward compatibility: existing batch API still works
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestBatchAPIUnchanged:
|
||||
def test_index_files_still_works(self, workspace):
|
||||
pipeline = workspace["pipeline"]
|
||||
src_dir = workspace["src_dir"]
|
||||
|
||||
f1 = _write_file(src_dir, "batch1.py", "batch1 = 1\n")
|
||||
f2 = _write_file(src_dir, "batch2.py", "batch2 = 2\n")
|
||||
|
||||
stats = pipeline.index_files([f1, f2], root=src_dir)
|
||||
assert stats.files_processed == 2
|
||||
assert stats.chunks_created >= 2
|
||||
|
||||
def test_index_files_works_without_metadata(self, workspace):
|
||||
"""Batch API should work even without MetadataStore."""
|
||||
cfg = workspace["config"]
|
||||
pipeline_no_meta = IndexingPipeline(
|
||||
embedder=FakeEmbedder(),
|
||||
binary_store=BinaryStore(workspace["store_dir"] / "no_meta", DIM, cfg),
|
||||
ann_index=ANNIndex(workspace["store_dir"] / "no_meta", DIM, cfg),
|
||||
fts=FTSEngine(str(workspace["store_dir"] / "no_meta_fts.db")),
|
||||
config=cfg,
|
||||
)
|
||||
src_dir = workspace["src_dir"]
|
||||
f = _write_file(src_dir, "no_meta_batch.py", "x = 1\n")
|
||||
stats = pipeline_no_meta.index_files([f], root=src_dir)
|
||||
assert stats.files_processed == 1
|
||||
Reference in New Issue
Block a user