Content
# Tool List
<p align="center">
<strong>🧵 Code Library Context Engine for AI Agents</strong>
</p>
<p align="center">
<em>Semantic Code Retrieval for AI Agents — Hybrid Search • Graph Expansion • Token-Aware Packing</em>
</p>
---
**ContextWeaver** is a semantic search engine designed for AI code assistants, adopting hybrid search (vector + lexical), intelligent context expansion, and token-aware packing strategies to provide accurate, relevant, and context-complete code snippets for LLMs.
<p align="center">
<img src="docs/architecture.png" alt="ContextWeaver Architecture Overview" width="800" />
</p>
## ✨ Core Features
### 🔍 Hybrid Search Engine
- **Vector Retrieval**: Deep understanding based on semantic similarity
- **Lexical Retrieval (Lexical/FTS)**: Precise matching of function names, class names, and other technical terms
- **RRF Fusion (Reciprocal Rank Fusion)**: Intelligent fusion of multi-path recall results
### 🧠 AST Semantic Fragmentation
- **Tree-sitter Parsing**: Supports TypeScript, JavaScript, Python, Go, Java, and Rust languages
- **Dual-Text Strategy**: `displayCode` for display, `vectorText` for Embedding
- **Gap-Aware Merging**: Intelligent processing of code gaps, maintaining semantic integrity
- **Breadcrumb Injection**: Vector text contains hierarchical paths, improving retrieval recall
### 📊 Three-Stage Context Expansion
- **E1 Neighbor Expansion**: Adjacent chunks in the same file, ensuring code block integrity
- **E2 Breadcrumb Completion**: Other methods under the same class/function, understanding overall structure
- **E3 Import Parsing**: Cross-file dependency tracking (configurable switch)
### 🎯 Smart Truncation Strategy (Smart TopK)
- **Anchor & Floor**: Dynamic threshold + absolute lower limit dual insurance
- **Delta Guard**: Prevents Top1 outlier scenarios from misjudgment
- **Safe Harbor**: First N results only check the lower limit, ensuring basic recall
### 🔌 MCP Native Support
- **MCP Server Mode**: One-click startup of Model Context Protocol server
- **Intent and Terminology Separation**: LLM-friendly API design
- **Automatic Indexing**: First query automatically triggers indexing, incremental update transparent and imperceptible
## 📦 Quick Start
### Environment Requirements
- Node.js >= 20
- pnpm (recommended) or npm
### Installation
```bash
# Global installation
npm install -g @hsingjui/contextweaver
# or use pnpm
pnpm add -g @hsingjui/contextweaver
```
### Initialization Configuration
```bash
# Initialize configuration file (create ~/.contextweaver/.env)
contextweaver init
# or abbreviation
cw init
```
Edit `~/.contextweaver/.env`, fill in your API Key:
```bash
# Embedding API configuration (required)
EMBEDDINGS_API_KEY=your-api-key-here
EMBEDDINGS_BASE_URL=https://api.siliconflow.cn/v1/embeddings
EMBEDDINGS_MODEL=BAAI/bge-m3
EMBEDDINGS_MAX_CONCURRENCY=10
EMBEDDINGS_DIMENSIONS=1024
# Reranker configuration (required)
RERANK_API_KEY=your-api-key-here
RERANK_BASE_URL=https://api.siliconflow.cn/v1/rerank
RERANK_MODEL=BAAI/bge-reranker-v2-m3
RERANK_TOP_N=20
# Ignore patterns (optional, comma-separated)
# IGNORE_PATTERNS=.venv,node_modules
```
### Index Codebase
```bash
# Execute in the codebase root directory
contextweaver index
# Specify path
contextweaver index /path/to/your/project
# Force re-indexing
contextweaver index --force
```
### Local Search
```bash
# Semantic search
cw search --information-request "How is the user authentication process implemented?"
# With precise technical terms
cw search --information-request "Database connection logic" --technical-terms "DatabasePool,Connection"
```
### Start MCP Server
```bash
# Start MCP server (for AI assistants like Claude)
contextweaver mcp
```
## 🔧 MCP Integration Configuration
### Claude Desktop Configuration
Add to Claude Desktop's configuration file:
```json
{
"mcpServers": {
"contextweaver": {
"command": "contextweaver",
"args": ["mcp"]
}
}
}
```
### MCP Tool Description
ContextWeaver provides a core MCP tool: `codebase-retrieval`
#### Parameter Description
| Parameter | Type | Required | Description |
|------|------|------|------|
| `repo_path` | string | ✅ | Absolute path of the codebase root directory |
| `information_request` | string | ✅ | Semantic intent description in natural language |
| `technical_terms` | string[] | ❌ | Precise technical terms (class names, function names, etc.) |
#### Design Philosophy
- **Intent and Terminology Separation**: `information_request` describes "what to do", `technical_terms` filters "what it's called"
- **Same-file Context Priority**: Default provides same-file context, cross-file exploration initiated by Agent
- **Return to Agent Instinct**: Tool only responsible for positioning, cross-file exploration triggered by Agent as needed
## 🏗️ Architecture Design
```mermaid
flowchart TB
subgraph Interface["CLI / MCP Interface"]
CLI[contextweaver CLI]
MCP[MCP Server]
end
subgraph Search["SearchService"]
VR[Vector Retrieval]
LR[Lexical Retrieval]
RRF[RRF Fusion + Rerank]
VR --> RRF
LR --> RRF
end
subgraph Expand["Context Expansion"]
GE[GraphExpander]
CP[ContextPacker]
GE --> CP
end
subgraph Storage["Storage Layer"]
VS[(VectorStore<br/>LanceDB)]
DB[(SQLite<br/>FTS5)]
end
subgraph Index["Indexing Pipeline"]
CR[Crawler<br/>fdir] --> SS[SemanticSplitter<br/>Tree-sitter] --> IX[Indexer<br/>Batch Embedding]
end
Interface --> Search
RRF --> GE
Search <--> Storage
Expand <--> Storage
Index --> Storage
```
### Core Module Description
| Module | Responsibility |
|------|------|
| **SearchService** | Hybrid search core, coordinating vector/lexical recall, RRF fusion, and Rerank precision |
| **GraphExpander** | Context expander, executing E1/E2/E3 three-stage expansion strategy |
| **ContextPacker** | Context packager, responsible for paragraph merging and token budget control |
| **VectorStore** | LanceDB adapter layer, managing vector index additions, deletions, and queries |
| **SQLite (FTS5)** | Metadata storage + full-text search index |
| **SemanticSplitter** | AST semantic splitter, based on Tree-sitter parsing |
## 📁 Project Structure
```
contextweaver/
├── src/
│ ├── index.ts # CLI entry
│ ├── config.ts # Configuration management (environment variables)
│ ├── api/ # External API encapsulation
│ │ ├── embed.ts # Embedding API
│ │ └── rerank.ts # Reranker API
│ ├── chunking/ # Semantic fragmentation
│ │ ├── SemanticSplitter.ts # AST semantic splitter
│ │ ├── SourceAdapter.ts # Source code adapter
│ │ ├── LanguageSpec.ts # Language specification definition
│ │ └── ParserPool.ts # Tree-sitter parser pool
│ ├── scanner/ # File scanning
│ │ ├── crawler.ts # File system traversal
│ │ ├── processor.ts # File processing
│ │ └── filter.ts # Filtering rules
│ ├── indexer/ # Indexer
│ │ └── index.ts # Batch indexing logic
│ ├── vectorStore/ # Vector storage
│ │ └── index.ts # LanceDB adapter layer
│ ├── db/ # Database
│ │ └── index.ts # SQLite + FTS5
│ ├── search/ # Search service
│ │ ├── SearchService.ts # Core search service
│ │ ├── GraphExpander.ts # Context expander
│ │ ├── ContextPacker.ts # Context packager
│ │ ├── fts.ts # Full-text search
│ │ ├── config.ts # Search configuration
│ │ ├── types.ts # Type definitions
│ │ └── resolvers/ # Multi-language import parsers
│ │ ├── JsTsResolver.ts
│ │ ├── PythonResolver.ts
│ │ ├── GoResolver.ts
│ │ ├── JavaResolver.ts
│ │ └── RustResolver.ts
│ ├── mcp/ # MCP server
│ │ ├── server.ts # MCP server implementation
│ │ ├── main.ts # MCP entry
│ │ └── tools/
│ │ └── codebaseRetrieval.ts # Code retrieval tool
│ └── utils/ # Utility functions
│ └── logger.ts # Logging system
├── package.json
└── tsconfig.json
```
## ⚙️ Configuration Details
### Environment Variables
| Variable Name | Required | Default Value | Description |
|--------|------|--------|------|
| `EMBEDDINGS_API_KEY` | ✅ | - | Embedding API key |
| `EMBEDDINGS_BASE_URL` | ✅ | - | Embedding API address |
| `EMBEDDINGS_MODEL` | ✅ | - | Embedding model name |
| `EMBEDDINGS_MAX_CONCURRENCY` | ❌ | 10 | Embedding concurrency |
| `EMBEDDINGS_DIMENSIONS` | ❌ | 1024 | Vector dimensions |
| `RERANK_API_KEY` | ✅ | - | Reranker API key |
| `RERANK_BASE_URL` | ✅ | - | Reranker API address |
| `RERANK_MODEL` | ✅ | - | Reranker model name |
| `RERANK_TOP_N` | ❌ | 20 | Rerank return quantity |
| `IGNORE_PATTERNS` | ❌ | - | Additional ignore patterns |
### Search Configuration Parameters
```typescript
interface SearchConfig {
// === Recall stage ===
vectorTopK: number; // Vector recall quantity (default 30)
vectorTopM: number; // Fusion of vector results (default 30)
ftsTopKFiles: number; // FTS recall file quantity (default 15)
lexChunksPerFile: number; // Lexical chunks per file (default 3)
lexTotalChunks: number; // Lexical total chunks (default 30)
// === Fusion stage ===
rrfK0: number; // RRF smoothing constant (default 60)
wVec: number; // Vector weight (default 1.0)
wLex: number; // Lexical weight (default 0.5)
fusedTopM: number; // Fusion and rerank quantity (default 40)
// === Rerank ===
rerankTopN: number; // Rerank retained quantity (default 10)
maxRerankChars: number; // Rerank text maximum characters (default 1200)
// === Expansion strategy ===
neighborHops: number; // E1 neighbor hops (default 2)
breadcrumbExpandLimit: number; // E2 breadcrumb completion quantity (default 3)
importFilesPerSeed: number; // E3 import files per seed (default 0)
chunksPerImportFile: number; // E3 import file chunks (default 0)
// === Smart TopK ===
enableSmartTopK: boolean; // Enable smart truncation (default true)
smartTopScoreRatio: number; // Dynamic threshold ratio (default 0.5)
smartMinScore: number; // Absolute lower limit (default 0.25)
smartMinK: number; // Safe Harbor quantity (default 2)
smartMaxK: number; // Hard upper limit (default 15)
}
```
## 🌍 Multi-Language Support
ContextWeaver supports AST parsing for the following programming languages through Tree-sitter:
| Language | AST Parsing | Import Parsing | File Extensions |
|------|----------|-------------|-----------|
| TypeScript | ✅ | ✅ | `.ts`, `.tsx` |
| JavaScript | ✅ | ✅ | `.js`, `.jsx`, `.mjs` |
| Python | ✅ | ✅ | `.py` |
| Go | ✅ | ✅ | `.go` |
| Java | ✅ | ✅ | `.java` |
| Rust | ✅ | ✅ | `.rs` |
Other languages use a fallback line-based fragmentation strategy, still indexable and searchable.
## 🔄 Workflow
### Indexing Process
```
1. Crawler → Traverse file system, filter ignore items
2. Processor → Read file content, calculate hash
3. Splitter → AST parsing, semantic fragmentation
4. Indexer → Batch embedding, write vector database
5. FTS Index → Update full-text search index
```
### Search Process
```
1. Query Parse → Parse query, separate semantics and terms
2. Hybrid Recall → Vector + lexical dual-path recall
3. RRF Fusion → Reciprocal Rank Fusion fusion
4. Rerank → Cross-encoder precision ranking
5. Smart Cutoff → Intelligent score truncation
6. Graph Expand → Neighbor/breadcrumb/import expansion
7. Context Pack → Paragraph merging, token budget
8. Format Output → Format return to LLM
```
## 📊 Performance Characteristics
- **Incremental Indexing**: Only process changed files, secondary indexing speed increased by 10x+
- **Batch Embedding**: Adaptive batch size, support concurrency control
- **Rate Limit Recovery**: 429 error automatic retreat, progressive recovery
- **Connection Pool Reuse**: Tree-sitter parser pool reuse
- **File Index Cache**: GraphExpander file path index lazy load
## 🐛 Logs and Debugging
Log file location: `~/.contextweaver/logs/app.YYYY-MM-DD.log`
Set log level:
```bash
# Enable debug log
LOG_LEVEL=debug contextweaver search --information-request "..."
```
## 📄 Open-Source Protocol
This project uses the MIT license.
## 🙏 Acknowledgements
- [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) - High-performance syntax parsing
- [LanceDB](https://lancedb.com/) - Embedded vector database
- [MCP](https://modelcontextprotocol.io/) - Model Context Protocol
- [SiliconFlow](https://siliconflow.cn/) - Recommended Embedding/Reranker API service
<p align="center">
<sub>Made with ❤️ for AI-assisted coding</sub>
</p>
Connection Info
You Might Also Like
everything-claude-code
Complete Claude Code configuration collection - agents, skills, hooks,...
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
servers
Model Context Protocol Servers
servers
Model Context Protocol Servers
Time
A Model Context Protocol server for time and timezone conversions.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.