Content

# Tool List 🧵 Code Library Context Engine for AI Agents Semantic Code Retrieval for AI Agents — Hybrid Search • Graph Expansion • Token-Aware Packing --- **ContextWeaver** is a semantic search engine designed for AI code assistants, adopting hybrid search (vector + lexical), intelligent context expansion, and token-aware packing strategies to provide accurate, relevant, and context-complete code snippets for LLMs. <img src="docs/architecture.png" alt="ContextWeaver Architecture Overview" width="800" /> ## ✨ Core Features ### 🔍 Hybrid Search Engine - **Vector Retrieval**: Deep understanding based on semantic similarity - **Lexical Retrieval (Lexical/FTS)**: Precise matching of function names, class names, and other technical terms - **RRF Fusion (Reciprocal Rank Fusion)**: Intelligent fusion of multi-path recall results ### 🧠 AST Semantic Fragmentation - **Tree-sitter Parsing**: Supports TypeScript, JavaScript, Python, Go, Java, and Rust languages - **Dual-Text Strategy**: `displayCode` for display, `vectorText` for Embedding - **Gap-Aware Merging**: Intelligent processing of code gaps, maintaining semantic integrity - **Breadcrumb Injection**: Vector text contains hierarchical paths, improving retrieval recall ### 📊 Three-Stage Context Expansion - **E1 Neighbor Expansion**: Adjacent chunks in the same file, ensuring code block integrity - **E2 Breadcrumb Completion**: Other methods under the same class/function, understanding overall structure - **E3 Import Parsing**: Cross-file dependency tracking (configurable switch) ### 🎯 Smart Truncation Strategy (Smart TopK) - **Anchor & Floor**: Dynamic threshold + absolute lower limit dual insurance - **Delta Guard**: Prevents Top1 outlier scenarios from misjudgment - **Safe Harbor**: First N results only check the lower limit, ensuring basic recall ### 🔌 MCP Native Support - **MCP Server Mode**: One-click startup of Model Context Protocol server - **Intent and Terminology Separation**: LLM-friendly API design - **Automatic Indexing**: First query automatically triggers indexing, incremental update transparent and imperceptible ## 📦 Quick Start ### Environment Requirements - Node.js >= 20 - pnpm (recommended) or npm ### Installation ```bash # Global installation npm install -g @hsingjui/contextweaver # or use pnpm pnpm add -g @hsingjui/contextweaver ``` ### Initialization Configuration ```bash # Initialize configuration file (create ~/.contextweaver/.env) contextweaver init # or abbreviation cw init ``` Edit `~/.contextweaver/.env`, fill in your API Key: ```bash # Embedding API configuration (required) EMBEDDINGS_API_KEY=your-api-key-here EMBEDDINGS_BASE_URL=https://api.siliconflow.cn/v1/embeddings EMBEDDINGS_MODEL=BAAI/bge-m3 EMBEDDINGS_MAX_CONCURRENCY=10 EMBEDDINGS_DIMENSIONS=1024 # Reranker configuration (required) RERANK_API_KEY=your-api-key-here RERANK_BASE_URL=https://api.siliconflow.cn/v1/rerank RERANK_MODEL=BAAI/bge-reranker-v2-m3 RERANK_TOP_N=20 # Ignore patterns (optional, comma-separated) # IGNORE_PATTERNS=.venv,node_modules ``` ### Index Codebase ```bash # Execute in the codebase root directory contextweaver index # Specify path contextweaver index /path/to/your/project # Force re-indexing contextweaver index --force ``` ### Local Search ```bash # Semantic search cw search --information-request "How is the user authentication process implemented?" # With precise technical terms cw search --information-request "Database connection logic" --technical-terms "DatabasePool,Connection" ``` ### Start MCP Server ```bash # Start MCP server (for AI assistants like Claude) contextweaver mcp ``` ## 🔧 MCP Integration Configuration ### Claude Desktop Configuration Add to Claude Desktop's configuration file: ```json { "mcpServers": { "contextweaver": { "command": "contextweaver", "args": ["mcp"] } } } ``` ### MCP Tool Description ContextWeaver provides a core MCP tool: `codebase-retrieval` #### Parameter Description | Parameter | Type | Required | Description | |------|------|------|------| | `repo_path` | string | ✅ | Absolute path of the codebase root directory | | `information_request` | string | ✅ | Semantic intent description in natural language | | `technical_terms` | string[] | ❌ | Precise technical terms (class names, function names, etc.) | #### Design Philosophy - **Intent and Terminology Separation**: `information_request` describes "what to do", `technical_terms` filters "what it's called" - **Same-file Context Priority**: Default provides same-file context, cross-file exploration initiated by Agent - **Return to Agent Instinct**: Tool only responsible for positioning, cross-file exploration triggered by Agent as needed ## 🏗️ Architecture Design ```mermaid flowchart TB subgraph Interface["CLI / MCP Interface"] CLI[contextweaver CLI] MCP[MCP Server] end subgraph Search["SearchService"] VR[Vector Retrieval] LR[Lexical Retrieval] RRF[RRF Fusion + Rerank] VR --> RRF LR --> RRF end subgraph Expand["Context Expansion"] GE[GraphExpander] CP[ContextPacker] GE --> CP end subgraph Storage["Storage Layer"] VS[(VectorStore LanceDB)] DB[(SQLite FTS5)] end subgraph Index["Indexing Pipeline"] CR[Crawler fdir] --> SS[SemanticSplitter Tree-sitter] --> IX[Indexer Batch Embedding] end Interface --> Search RRF --> GE Search <--> Storage Expand <--> Storage Index --> Storage ``` ### Core Module Description | Module | Responsibility | |------|------| | **SearchService** | Hybrid search core, coordinating vector/lexical recall, RRF fusion, and Rerank precision | | **GraphExpander** | Context expander, executing E1/E2/E3 three-stage expansion strategy | | **ContextPacker** | Context packager, responsible for paragraph merging and token budget control | | **VectorStore** | LanceDB adapter layer, managing vector index additions, deletions, and queries | | **SQLite (FTS5)** | Metadata storage + full-text search index | | **SemanticSplitter** | AST semantic splitter, based on Tree-sitter parsing | ## 📁 Project Structure ``` contextweaver/ ├── src/ │ ├── index.ts # CLI entry │ ├── config.ts # Configuration management (environment variables) │ ├── api/ # External API encapsulation │ │ ├── embed.ts # Embedding API │ │ └── rerank.ts # Reranker API │ ├── chunking/ # Semantic fragmentation │ │ ├── SemanticSplitter.ts # AST semantic splitter │ │ ├── SourceAdapter.ts # Source code adapter │ │ ├── LanguageSpec.ts # Language specification definition │ │ └── ParserPool.ts # Tree-sitter parser pool │ ├── scanner/ # File scanning │ │ ├── crawler.ts # File system traversal │ │ ├── processor.ts # File processing │ │ └── filter.ts # Filtering rules │ ├── indexer/ # Indexer │ │ └── index.ts # Batch indexing logic │ ├── vectorStore/ # Vector storage │ │ └── index.ts # LanceDB adapter layer │ ├── db/ # Database │ │ └── index.ts # SQLite + FTS5 │ ├── search/ # Search service │ │ ├── SearchService.ts # Core search service │ │ ├── GraphExpander.ts # Context expander │ │ ├── ContextPacker.ts # Context packager │ │ ├── fts.ts # Full-text search │ │ ├── config.ts # Search configuration │ │ ├── types.ts # Type definitions │ │ └── resolvers/ # Multi-language import parsers │ │ ├── JsTsResolver.ts │ │ ├── PythonResolver.ts │ │ ├── GoResolver.ts │ │ ├── JavaResolver.ts │ │ └── RustResolver.ts │ ├── mcp/ # MCP server │ │ ├── server.ts # MCP server implementation │ │ ├── main.ts # MCP entry │ │ └── tools/ │ │ └── codebaseRetrieval.ts # Code retrieval tool │ └── utils/ # Utility functions │ └── logger.ts # Logging system ├── package.json └── tsconfig.json ``` ## ⚙️ Configuration Details ### Environment Variables | Variable Name | Required | Default Value | Description | |--------|------|--------|------| | `EMBEDDINGS_API_KEY` | ✅ | - | Embedding API key | | `EMBEDDINGS_BASE_URL` | ✅ | - | Embedding API address | | `EMBEDDINGS_MODEL` | ✅ | - | Embedding model name | | `EMBEDDINGS_MAX_CONCURRENCY` | ❌ | 10 | Embedding concurrency | | `EMBEDDINGS_DIMENSIONS` | ❌ | 1024 | Vector dimensions | | `RERANK_API_KEY` | ✅ | - | Reranker API key | | `RERANK_BASE_URL` | ✅ | - | Reranker API address | | `RERANK_MODEL` | ✅ | - | Reranker model name | | `RERANK_TOP_N` | ❌ | 20 | Rerank return quantity | | `IGNORE_PATTERNS` | ❌ | - | Additional ignore patterns | ### Search Configuration Parameters ```typescript interface SearchConfig { // === Recall stage === vectorTopK: number; // Vector recall quantity (default 30) vectorTopM: number; // Fusion of vector results (default 30) ftsTopKFiles: number; // FTS recall file quantity (default 15) lexChunksPerFile: number; // Lexical chunks per file (default 3) lexTotalChunks: number; // Lexical total chunks (default 30) // === Fusion stage === rrfK0: number; // RRF smoothing constant (default 60) wVec: number; // Vector weight (default 1.0) wLex: number; // Lexical weight (default 0.5) fusedTopM: number; // Fusion and rerank quantity (default 40) // === Rerank === rerankTopN: number; // Rerank retained quantity (default 10) maxRerankChars: number; // Rerank text maximum characters (default 1200) // === Expansion strategy === neighborHops: number; // E1 neighbor hops (default 2) breadcrumbExpandLimit: number; // E2 breadcrumb completion quantity (default 3) importFilesPerSeed: number; // E3 import files per seed (default 0) chunksPerImportFile: number; // E3 import file chunks (default 0) // === Smart TopK === enableSmartTopK: boolean; // Enable smart truncation (default true) smartTopScoreRatio: number; // Dynamic threshold ratio (default 0.5) smartMinScore: number; // Absolute lower limit (default 0.25) smartMinK: number; // Safe Harbor quantity (default 2) smartMaxK: number; // Hard upper limit (default 15) } ``` ## 🌍 Multi-Language Support ContextWeaver supports AST parsing for the following programming languages through Tree-sitter: | Language | AST Parsing | Import Parsing | File Extensions | |------|----------|-------------|-----------| | TypeScript | ✅ | ✅ | `.ts`, `.tsx` | | JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs` | | Python | ✅ | ✅ | `.py` | | Go | ✅ | ✅ | `.go` | | Java | ✅ | ✅ | `.java` | | Rust | ✅ | ✅ | `.rs` | Other languages use a fallback line-based fragmentation strategy, still indexable and searchable. ## 🔄 Workflow ### Indexing Process ``` 1. Crawler → Traverse file system, filter ignore items 2. Processor → Read file content, calculate hash 3. Splitter → AST parsing, semantic fragmentation 4. Indexer → Batch embedding, write vector database 5. FTS Index → Update full-text search index ``` ### Search Process ``` 1. Query Parse → Parse query, separate semantics and terms 2. Hybrid Recall → Vector + lexical dual-path recall 3. RRF Fusion → Reciprocal Rank Fusion fusion 4. Rerank → Cross-encoder precision ranking 5. Smart Cutoff → Intelligent score truncation 6. Graph Expand → Neighbor/breadcrumb/import expansion 7. Context Pack → Paragraph merging, token budget 8. Format Output → Format return to LLM ``` ## 📊 Performance Characteristics - **Incremental Indexing**: Only process changed files, secondary indexing speed increased by 10x+ - **Batch Embedding**: Adaptive batch size, support concurrency control - **Rate Limit Recovery**: 429 error automatic retreat, progressive recovery - **Connection Pool Reuse**: Tree-sitter parser pool reuse - **File Index Cache**: GraphExpander file path index lazy load ## 🐛 Logs and Debugging Log file location: `~/.contextweaver/logs/app.YYYY-MM-DD.log` Set log level: ```bash # Enable debug log LOG_LEVEL=debug contextweaver search --information-request "..." ``` ## 📄 Open-Source Protocol This project uses the MIT license. ## 🙏 Acknowledgements - [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) - High-performance syntax parsing - [LanceDB](https://lancedb.com/) - Embedded vector database - [MCP](https://modelcontextprotocol.io/) - Model Context Protocol - [SiliconFlow](https://siliconflow.cn/) - Recommended Embedding/Reranker API service Made with ❤️ for AI-assisted coding

ContextWeaver

Content

Connection Info

You Might Also Like

everything-claude-code

markitdown

servers

servers

Time

Filesystem

ContextWeaver

Scan with WeChat to Share

Authentication Required

Content

Connection Info

You Might Also Like

everything-claude-code

markitdown

servers

servers

Time

Filesystem