Content
# ContextWeaver
<p align="center">
<strong>🧵 Codebase Context Engine Carefully Woven for AI Agents</strong>
</p>
<p align="center">
<em>Semantic Code Retrieval for AI Agents — Hybrid Search • Graph Expansion • Token-Aware Packing</em>
</p>
---
**ContextWeaver** is a semantic retrieval engine designed for AI code assistants, employing hybrid search (vector + lexical), intelligent context expansion, and Token-aware packing strategies to provide LLMs with precise, relevant, and contextually complete code snippets.
<p align="center">
<img src="docs/architecture.png" alt="ContextWeaver 架构概览" width="800" />
</p>
## ✨ Core Features
### 🔍 Hybrid Retrieval Engine
- **Vector Retrieval**: Deep understanding based on semantic similarity
- **Lexical/FTS**: Precise matching of technical terms such as function names and class names
- **RRF Fusion (Reciprocal Rank Fusion)**: Intelligent fusion of multi-channel recall results
### 🧠 AST Semantic Sharding
- **Tree-sitter Parsing**: Supports six major languages: TypeScript, JavaScript, Python, Go, Java, and Rust
- **Dual-Text Strategy**: `displayCode` for display, `vectorText` for Embedding
- **Gap-Aware Merging**: Intelligent handling of code gaps to maintain semantic integrity
- **Breadcrumb Injection**: Vector text contains hierarchical paths to improve retrieval recall rate
### 📊 Three-Stage Context Expansion
- **E1 Neighbor Expansion**: Adjacent chunks before and after the same file to ensure code block integrity
- **E2 Breadcrumb Completion**: Other methods under the same class/function to understand the overall structure
- **E3 Import Parsing**: Cross-file dependency tracking (configurable switch)
### 🎯 Smart TopK
- **Anchor & Floor**: Dynamic threshold + absolute lower limit double insurance
- **Delta Guard**: Prevent misjudgment in Top1 outlier scenarios
- **Safe Harbor**: The first N results only check the lower limit to ensure basic recall
### 🔌 MCP Native Support
- **MCP Server Mode**: One-click start of Model Context Protocol server
- **Zen Design Philosophy**: Separation of intent and terminology, LLM-friendly API design
- **Automatic Indexing**: Automatic triggering of indexing for the first query, transparent and imperceptible incremental updates
## 📦 Quick Start
### Environment Requirements
- Node.js >= 20
- pnpm (recommended) or npm
### Installation
```bash
# Global installation
npm install -g @hsingjui/contextweaver
# Or use pnpm
pnpm add -g @hsingjui/contextweaver
```
### Initialize Configuration
```bash
# Initialize the configuration file (create ~/.contextweaver/.env)
contextweaver init
# Or shorthand
cw init
```
Edit `~/.contextweaver/.env` and enter your API Key:
```bash
# Embedding API configuration (required)
EMBEDDINGS_API_KEY=your-api-key-here
EMBEDDINGS_BASE_URL=https://api.siliconflow.cn/v1/embeddings
EMBEDDINGS_MODEL=BAAI/bge-m3
EMBEDDINGS_MAX_CONCURRENCY=10
EMBEDDINGS_DIMENSIONS=1024
# Reranker configuration (required)
RERANK_API_KEY=your-api-key-here
RERANK_BASE_URL=https://api.siliconflow.cn/v1/rerank
RERANK_MODEL=BAAI/bge-reranker-v2-m3
RERANK_TOP_N=20
# Ignore patterns (optional, comma separated)
# IGNORE_PATTERNS=.venv,node_modules
```
### Index Codebase
```bash
# Execute in the root directory of the code base
contextweaver index
# Specify path
contextweaver index /path/to/your/project
# Force reindexing
contextweaver index --force
```
### Local Search
```bash
# Semantic search
cw search --information-request "用户认证流程是如何实现的?"
# With precise terminology
cw search --information-request "数据库连接逻辑" --technical-terms "DatabasePool,Connection"
```
### Start MCP Server
```bash
# Start the MCP server (for use by Claude and other AI assistants)
contextweaver mcp
```
## 🔧 MCP Integration Configuration
### Claude Desktop Configuration
Add the following to the Claude Desktop configuration file:
```json
{
"mcpServers": {
"contextweaver": {
"command": "contextweaver",
"args": ["mcp"]
}
}
}
```
### MCP Tool Description
ContextWeaver provides a core MCP tool: `codebase-retrieval`
#### Parameter Description
| Parameter | Type | Required | Description |
|------|------|------|------|
| `repo_path` | string | ✅ | Absolute path to the codebase root directory |
| `information_request` | string | ✅ | Semantic intent description in natural language |
| `technical_terms` | string[] | ❌ | Precise technical terms (class names, function names, etc.) |
#### Design Philosophy (Zen Design)
- **Separation of Intent and Terminology**: `information_request` describes "what to do", `technical_terms` filters "what to call"
- **Golden Defaults**: Provides same-file context, prohibits cross-file crawling by default
- **Return to Agent Instinct**: The tool is only responsible for locating, and cross-file exploration is initiated by the Agent independently
## 🏗️ Architecture Design
```mermaid
flowchart TB
subgraph Interface["CLI / MCP Interface"]
CLI[contextweaver CLI]
MCP[MCP Server]
end
subgraph Search["SearchService"]
VR[Vector Retrieval]
LR[Lexical Retrieval]
RRF[RRF Fusion + Rerank]
VR --> RRF
LR --> RRF
end
subgraph Expand["Context Expansion"]
GE[GraphExpander]
CP[ContextPacker]
GE --> CP
end
subgraph Storage["Storage Layer"]
VS[(VectorStore<br/>LanceDB)]
DB[(SQLite<br/>FTS5)]
end
subgraph Index["Indexing Pipeline"]
CR[Crawler<br/>fdir] --> SS[SemanticSplitter<br/>Tree-sitter] --> IX[Indexer<br/>Batch Embedding]
end
Interface --> Search
RRF --> GE
Search <--> Storage
Expand <--> Storage
Index --> Storage
```
### Core Module Description
| Module | Responsibility |
|------|------|
| **SearchService** | Hybrid search core, coordinating vector/lexical recall, RRF fusion, Rerank fine ranking |
| **GraphExpander** | Context expander, executing E1/E2/E3 three-stage expansion strategy |
| **ContextPacker** | Context packer, responsible for paragraph merging and Token budget control |
| **VectorStore** | LanceDB adaptation layer, managing vector index CRUD |
| **SQLite (FTS5)** | Metadata storage + full-text search index |
| **SemanticSplitter** | AST semantic splitter, based on Tree-sitter parsing |
## 📁 Project Structure
```
contextweaver/
├── src/
│ ├── index.ts # CLI entry
│ ├── config.ts # Configuration management (environment variables)
│ ├── api/ # External API encapsulation
│ │ ├── embed.ts # Embedding API
│ │ └── rerank.ts # Reranker API
│ ├── chunking/ # Semantic sharding
│ │ ├── SemanticSplitter.ts # AST semantic splitter
│ │ ├── SourceAdapter.ts # Source code adapter
│ │ ├── LanguageSpec.ts # Language specification definition
│ │ └── ParserPool.ts # Tree-sitter parser pool
│ ├── scanner/ # File scanning
│ │ ├── crawler.ts # File system traversal
│ │ ├── processor.ts # File processing
│ │ └── filter.ts # Filtering rules
│ ├── indexer/ # Indexer
│ │ └── index.ts # Batch indexing logic
│ ├── vectorStore/ # Vector storage
│ │ └── index.ts # LanceDB adaptation layer
│ ├── db/ # Database
│ │ └── index.ts # SQLite + FTS5
│ ├── search/ # Search service
│ │ ├── SearchService.ts # Core search service
│ │ ├── GraphExpander.ts # Context expander
│ │ ├── ContextPacker.ts # Context packer
│ │ ├── fts.ts # Full-text search
│ │ ├── config.ts # Search configuration
│ │ ├── types.ts # Type definition
│ │ └── resolvers/ # Multi-language Import parser
│ │ ├── JsTsResolver.ts
│ │ ├── PythonResolver.ts
│ │ ├── GoResolver.ts
│ │ ├── JavaResolver.ts
│ │ └── RustResolver.ts
│ ├── mcp/ # MCP server
│ │ ├── server.ts # MCP server implementation
│ │ ├── main.ts # MCP entry
│ │ └── tools/
│ │ └── codebaseRetrieval.ts # Code retrieval tool
│ └── utils/ # Utility functions
│ └── logger.ts # Log system
├── package.json
└── tsconfig.json
```
## ⚙️ Configuration Details
### Environment Variables
| Variable Name | Required | Default Value | Description |
|--------|------|--------|------|
| `EMBEDDINGS_API_KEY` | ✅ | - | Embedding API key |
| `EMBEDDINGS_BASE_URL` | ✅ | - | Embedding API address |
| `EMBEDDINGS_MODEL` | ✅ | - | Embedding model name |
| `EMBEDDINGS_MAX_CONCURRENCY` | ❌ | 10 | Embedding concurrency |
| `EMBEDDINGS_DIMENSIONS` | ❌ | 1024 | Vector dimension |
| `RERANK_API_KEY` | ✅ | - | Reranker API key |
| `RERANK_BASE_URL` | ✅ | - | Reranker API address |
| `RERANK_MODEL` | ✅ | - | Reranker model name |
| `RERANK_TOP_N` | ❌ | 20 | Number of Rerank returns |
| `IGNORE_PATTERNS` | ❌ | - | Additional ignore patterns |
### Search Configuration Parameters
```typescript
interface SearchConfig {
// === Recall Stage ===
vectorTopK: number; // Number of vector recalls (default 30)
vectorTopM: number; // Number of vector results sent to fusion (default 30)
ftsTopKFiles: number; // Number of FTS recall files (default 15)
lexChunksPerFile: number; // Number of lexical chunks per file (default 3)
lexTotalChunks: number; // Total number of lexical chunks (default 30)
// === Fusion Stage ===
rrfK0: number; // RRF smoothing constant (default 60)
wVec: number; // Vector weight (default 1.0)
wLex: number; // Lexical weight (default 0.5)
fusedTopM: number; // Number of fused sent rerank (default 40)
// === Rerank ===
rerankTopN: number; // Number of rerank retained (default 10)
maxRerankChars: number; // Maximum number of characters in Rerank text (default 1200)
// === Expansion Strategy ===
neighborHops: number; // E1 neighbor hops (default 2)
breadcrumbExpandLimit: number; // E2 breadcrumb completion number (default 3)
importFilesPerSeed: number; // E3 Number of imported files per seed (default 0)
chunksPerImportFile: number; // E3 chunks per imported file (default 0)
// === Smart TopK ===
enableSmartTopK: boolean; // Enable smart truncation (default true)
smartTopScoreRatio: number; // Dynamic threshold ratio (default 0.5)
smartMinScore: number; // Absolute lower limit (default 0.25)
smartMinK: number; // Safe Harbor number (default 2)
smartMaxK: number; // Hard upper limit (default 15)
}
```
## 🌍 Multi-Language Support
ContextWeaver natively supports AST parsing of the following programming languages through Tree-sitter:
| Language | AST Parsing | Import Parsing | File Extension |
|------|----------|-------------|-----------|
| TypeScript | ✅ | ✅ | `.ts`, `.tsx` |
| JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs` |
| Python | ✅ | ✅ | `.py` |
| Go | ✅ | ✅ | `.go` |
| Java | ✅ | ✅ | `.java` |
| Rust | ✅ | ✅ | `.rs` |
Other languages will adopt a line-based Fallback sharding strategy, which can still be indexed and searched normally.
## 🔄 Workflow
### Indexing Process
```
1. Crawler → Traverse the file system, filtering out ignored items
2. Processor → Read file content, calculate hash
3. Splitter → AST parsing, semantic sharding
4. Indexer → Batch Embedding, write to vector library
5. FTS Index → Update full-text search index
```
### Search Process
```
1. Query Parse → Parse query, separate semantics and terminology
2. Hybrid Recall → Vector + lexical dual-channel recall
3. RRF Fusion → Reciprocal Rank Fusion
4. Rerank → Cross-encoder fine ranking
5. Smart Cutoff → Smart score truncation
6. Graph Expand → Neighbor/breadcrumb/import expansion
7. Context Pack → Paragraph merging, Token budget
8. Format Output → Format and return to LLM
```
## 📊 Performance Characteristics
- **Incremental Indexing**: Only process changed files, secondary indexing speed increased by 10x+
- **Batch Embedding**: Adaptive batch size, supports concurrency control
- **Rate Limit Recovery**: Automatic backoff and progressive recovery when 429 errors occur
- **Connection Pool Reuse**: Tree-sitter parser pooling reuse
- **File Index Cache**: GraphExpander file path index lazy load
## 🐛 Logs and Debugging
Log file location: `~/.contextweaver/logs/app.YYYY-MM-DD.log`
Set log level:
```bash
# Enable debug logs
LOG_LEVEL=debug contextweaver search --information-request "..."
```
## 📄 Open Source License
This project is licensed under the MIT License.
## 🙏 Acknowledgements
- [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) - High-performance grammar parsing
- [LanceDB](https://lancedb.com/) - Embedded vector database
- [MCP](https://modelcontextprotocol.io/) - Model Context Protocol
- [SiliconFlow](https://siliconflow.cn/) - Recommended Embedding/Reranker API service
---
<p align="center">
<sub>Made with ❤️ for AI-assisted coding</sub>
</p>
Connection Info
You Might Also Like
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
firecrawl
Firecrawl MCP Server enables web scraping, crawling, and content extraction.
servers
Model Context Protocol Servers
Time
A Model Context Protocol server for time and timezone conversions.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
Sequential Thinking
A structured MCP server for dynamic problem-solving and reflective thinking.