Content
# Crawl4AI MCP Server
[](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server)
This is an intelligent information retrieval server based on MCP (Model Context Protocol) that provides powerful search capabilities and LLM-optimized webpage content understanding for AI assistant systems. Through multi-engine search and intelligent content extraction, it helps AI systems efficiently acquire and understand internet information, converting webpage content into formats best suited for LLM processing.
## Features
- 🔍 Powerful multi-engine search capability, supporting DuckDuckGo and Google
- 📚 LLM-optimized webpage content extraction, intelligently filtering non-core content
- 🎯 Focus on information value, automatically identifying and preserving key content
- 📝 Multiple output formats, supporting citation tracing
- 🚀 High-performance asynchronous design based on FastMCP
## Installation
### Method 1: Most Common Installation Scenario
1. Ensure your system meets the following requirements:
- Python >= 3.9
- Recommended to use a dedicated virtual environment
2. Clone the repository:
```bash
git clone https://github.com/yourusername/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
```
3. Create and activate virtual environment:
```bash
python -m venv crawl4ai_env
source crawl4ai_env/bin/activate # Linux/Mac
# or
.\crawl4ai_env\Scripts\activate # Windows
```
4. Install dependencies:
```bash
pip install -r requirements.txt
```
5. Install playwright browsers:
```bash
playwright install
```
### Method 2: Install to Claude Desktop Client via Smithery
Install and automatically configure Crawl4AI MCP's Claude desktop service to your local `Claude Extension Center` through [Smithery](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server):
```bash
npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude
```
## Usage
The server provides the following tools:
### search
Powerful web search tool supporting multiple search engines:
- DuckDuckGo search (default): No API key required, comprehensive handling of AbstractText, Results, and RelatedTopics
- Google search: Requires API key configuration, provides precise search results
- Supports using multiple engines simultaneously for more comprehensive results
Parameters:
- `query`: Search query string
- `num_results`: Number of results to return (default 10)
- `engine`: Search engine selection
- "duckduckgo": DuckDuckGo search (default)
- "google": Google search (requires API key)
- "all": Use all available search engines simultaneously
Example:
```python
# DuckDuckGo search (default)
{
"query": "python programming",
"num_results": 5
}
# Use all available engines
{
"query": "python programming",
"num_results": 5,
"engine": "all"
}
```
### read_url
LLM-optimized webpage content understanding tool, providing intelligent content extraction and format conversion:
- `markdown_with_citations`: Markdown with inline citations (default), maintaining information traceability
- `fit_markdown`: LLM-optimized streamlined content, removing redundant information
- `raw_markdown`: Basic HTML→Markdown conversion
- `references_markdown`: Separate citations/references section
- `fit_html`: Filtered HTML that generated fit_markdown
- `markdown`: Default Markdown format
Example:
```python
{
"url": "https://example.com",
"format": "markdown_with_citations"
}
```
To use Google search, configure API keys in config.json:
```json
{
"google": {
"api_key": "your-api-key",
"cse_id": "your-cse-id"
}
}
```
## LLM Content Optimization
The server employs a series of content optimization strategies for LLM:
- Intelligent Content Recognition: Automatically identifies and preserves article body and key information paragraphs
- Noise Filtering: Automatically filters navigation bars, advertisements, footers, and other content unhelpful for understanding
- Information Integrity: Preserves URL references, supports information traceability
- Length Optimization: Uses minimum word count threshold (10) to filter invalid segments
- Format Optimization: Default output in markdown_with_citations format, convenient for LLM understanding and citation
## Development Notes
Project structure:
```
crawl4ai_mcp_server/
├── src/
│ ├── index.py # Server main implementation
│ └── search.py # Search functionality implementation
├── config_demo.json # Configuration file example
├── pyproject.toml # Project configuration
├── requirements.txt # Dependency list
└── README.md # Project documentation
```
## Configuration
1. Copy the configuration example file:
```bash
cp config_demo.json config.json
```
2. To use Google search, configure API keys in config.json:
```json
{
"google": {
"api_key": "your-google-api-key",
"cse_id": "your-google-cse-id"
}
}
```
## Changelog
- 2025.02.08: Added search functionality, supporting DuckDuckGo (default) and Google search
- 2025.02.07: Refactored project structure, implemented using FastMCP, optimized dependency management
- 2025.02.07: Optimized content filtering configuration, improved token efficiency while maintaining URL integrity
## License
MIT License
## Contributing
Issues and Pull Requests are welcome!
## Author
- Owner: weidwonder
- Coder: Claude Sonnet 3.5
- 100% Code written by Claude. Cost: $9 ($2 for code writing, $7 cost for Debugging😭)
- 3 hours time cost. 0.5 hours for code writing, 0.5 hours for env preparing, 2 hours for debugging.😭
## Acknowledgments
Thanks to all developers who contributed to the project!
Special thanks to:
- [Crawl4ai](https://github.com/crawl4ai/crawl4ai) project providing excellent webpage content extraction technology support
Connection Info
You Might Also Like
mcp-chrome
MCP Server transforms your Chrome into an AI-powered assistant.
firecrawl
Firecrawl MCP Server for web scraping, crawling, and content extraction.
Firecrawl
Firecrawl MCP Server enables web scraping, crawling, and content extraction.
claude-context
Claude-context integrates your codebase as context for Claude.
CodeIndexer
CodeIndexer is a powerful tool for semantic code search in VS Code.
code-context
Code Context adds semantic search capabilities to Claude Code.