Content
# Crawl4AI MCP Server
[](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server)
This is an intelligent information retrieval server based on MCP (Model Context Protocol), providing powerful search capabilities and web content understanding optimized for LLM (Large Language Model) for AI assistant systems. Through multi-engine search and intelligent content extraction, it helps AI systems efficiently acquire and understand internet information, converting web content into the most suitable format for LLM processing.
## Features
- 🔍 Powerful multi-engine search capabilities, supporting DuckDuckGo and Google
- 📚 Web content extraction optimized for LLM, intelligently filtering out non-core content
- 🎯 Focused on information value, automatically identifying and retaining key content
- 📝 Multiple output formats, supporting citation tracing
- 🚀 High-performance asynchronous design based on FastMCP
## Installation
### Method 1: Most installation scenarios
1. Ensure your system meets the following requirements:
- Python >= 3.9
- It is recommended to use a dedicated virtual environment
2. Clone the repository:
```bash
git clone https://github.com/yourusername/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
```
3. Create and activate a virtual environment:
```bash
python -m venv crawl4ai_env
source crawl4ai_env/bin/activate # Linux/Mac
# or
.\crawl4ai_env\Scripts\activate # Windows
```
4. Install dependencies:
```bash
pip install -r requirements.txt
```
5. Install the playwright browser:
```bash
playwright install
```
### Method 2: Install to Claude Desktop Client via Smithery
Automatically configure the Crawl4AI MCP's Claude desktop service to your local `Claude Hub` via [Smithery](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server):
```bash
npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude
```
## Usage
The server provides the following tools:
### search
A powerful web search tool that supports multiple search engines:
- DuckDuckGo search (default): No API key required, fully processes AbstractText, Results, and RelatedTopics
- Google search: Requires API key configuration, provides accurate search results
- Supports using multiple engines simultaneously for more comprehensive results
Parameter description:
- `query`: Search query string
- `num_results`: Number of results to return (default 10)
- `engine`: Search engine selection
- "duckduckgo": DuckDuckGo search (default)
- "google": Google search (requires API key)
- "all": Use all available search engines simultaneously
Example:
```python
# DuckDuckGo search (default)
{
"query": "python programming",
"num_results": 5
}
# Using all available engines
{
"query": "python programming",
"num_results": 5,
"engine": "all"
}
```
### read_url
A web content understanding tool optimized for LLM, providing intelligent content extraction and format conversion:
- `markdown_with_citations`: Markdown with inline citations (default), maintaining information tracing
- `fit_markdown`: LLM-optimized concise content, removing redundant information
- `raw_markdown`: Basic HTML to Markdown conversion
- `references_markdown`: Separate references section
- `fit_html`: Filtered HTML generated from fit_markdown
- `markdown`: Default Markdown format
Example:
```python
{
"url": "https://example.com",
"format": "markdown_with_citations"
}
```
Example:
```python
# DuckDuckGo search (default)
{
"query": "python programming",
"num_results": 5
}
# Google search
{
"query": "python programming",
"num_results": 5,
"engine": "google"
}
```
To use Google search, you need to configure the API key in config.json:
```json
{
"google": {
"api_key": "your-api-key",
"cse_id": "your-cse-id"
}
}
```
## LLM Content Optimization
The server employs a series of content optimization strategies for LLM:
- Intelligent content recognition: Automatically identifies and retains the main body of the article and key information paragraphs
- Noise filtering: Automatically filters out navigation bars, ads, footers, and other content that is not helpful for understanding
- Information integrity: Retains URL references, supporting information tracing
- Length optimization: Filters out invalid segments using a minimum word count threshold (10)
- Format optimization: Defaults to outputting in markdown_with_citations format, facilitating LLM understanding and citation
## Development Instructions
Project structure:
```
crawl4ai_mcp_server/
├── src/
│ ├── index.py # Main implementation of the server
│ └── search.py # Implementation of search functionality
├── config_demo.json # Configuration file example
├── pyproject.toml # Project configuration
├── requirements.txt # Dependency list
└── README.md # Project documentation
```
## Configuration Instructions
1. Copy the configuration example file:
```bash
cp config_demo.json config.json
```
2. If you want to use Google search, configure the API key in config.json:
```json
{
"google": {
"api_key": "your-google-api-key",
"cse_id": "your-google-cse-id"
}
}
```
## Changelog
- 2025.02.08: Added search functionality, supporting DuckDuckGo (default) and Google search
- 2025.02.07: Refactored project structure, implemented using FastMCP, optimized dependency management
- 2025.02.07: Optimized content filtering configuration, improved token efficiency while maintaining URL integrity
## License
MIT License
## Contribution
Contributions in the form of Issues and Pull Requests are welcome!
## Authors
- Owner: weidwonder
- Coder: Claude Sonnet 3.5
- 100% Code written by Claude. Cost: $9 ($2 for code writing, $7 cost for Debugging😭)
- 3 hours time cost. 0.5 hours for code writing, 0.5 hours for environment preparation, 2 hours for debugging.😭
## Acknowledgments
Thanks to all developers who contributed to the project!
Special thanks to:
- [Crawl4ai](https://github.com/crawl4ai/crawl4ai) project for providing excellent web content extraction technology support.