Content

# Crawl4AI MCP Server [![smithery badge](https://smithery.ai/badge/@weidwonder/crawl4ai-mcp-server)](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server) This is an intelligent information retrieval server based on MCP (Model Context Protocol) that provides powerful search capabilities and LLM-optimized webpage content understanding for AI assistant systems. Through multi-engine search and intelligent content extraction, it helps AI systems efficiently acquire and understand internet information, converting webpage content into formats best suited for LLM processing. ## Features - 🔍 Powerful multi-engine search capability, supporting DuckDuckGo and Google - 📚 LLM-optimized webpage content extraction, intelligently filtering non-core content - 🎯 Focus on information value, automatically identifying and preserving key content - 📝 Multiple output formats, supporting citation tracing - 🚀 High-performance asynchronous design based on FastMCP ## Installation ### Method 1: Most Common Installation Scenario 1. Ensure your system meets the following requirements: - Python >= 3.9 - Recommended to use a dedicated virtual environment 2. Clone the repository: ```bash git clone https://github.com/yourusername/crawl4ai-mcp-server.git cd crawl4ai-mcp-server ``` 3. Create and activate virtual environment: ```bash python -m venv crawl4ai_env source crawl4ai_env/bin/activate # Linux/Mac # or .\crawl4ai_env\Scripts\activate # Windows ``` 4. Install dependencies: ```bash pip install -r requirements.txt ``` 5. Install playwright browsers: ```bash playwright install ``` ### Method 2: Install to Claude Desktop Client via Smithery Install and automatically configure Crawl4AI MCP's Claude desktop service to your local `Claude Extension Center` through [Smithery](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server): ```bash npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude ``` ## Usage The server provides the following tools: ### search Powerful web search tool supporting multiple search engines: - DuckDuckGo search (default): No API key required, comprehensive handling of AbstractText, Results, and RelatedTopics - Google search: Requires API key configuration, provides precise search results - Supports using multiple engines simultaneously for more comprehensive results Parameters: - `query`: Search query string - `num_results`: Number of results to return (default 10) - `engine`: Search engine selection - "duckduckgo": DuckDuckGo search (default) - "google": Google search (requires API key) - "all": Use all available search engines simultaneously Example: ```python # DuckDuckGo search (default) { "query": "python programming", "num_results": 5 } # Use all available engines { "query": "python programming", "num_results": 5, "engine": "all" } ``` ### read_url LLM-optimized webpage content understanding tool, providing intelligent content extraction and format conversion: - `markdown_with_citations`: Markdown with inline citations (default), maintaining information traceability - `fit_markdown`: LLM-optimized streamlined content, removing redundant information - `raw_markdown`: Basic HTML→Markdown conversion - `references_markdown`: Separate citations/references section - `fit_html`: Filtered HTML that generated fit_markdown - `markdown`: Default Markdown format Example: ```python { "url": "https://example.com", "format": "markdown_with_citations" } ``` To use Google search, configure API keys in config.json: ```json { "google": { "api_key": "your-api-key", "cse_id": "your-cse-id" } } ``` ## LLM Content Optimization The server employs a series of content optimization strategies for LLM: - Intelligent Content Recognition: Automatically identifies and preserves article body and key information paragraphs - Noise Filtering: Automatically filters navigation bars, advertisements, footers, and other content unhelpful for understanding - Information Integrity: Preserves URL references, supports information traceability - Length Optimization: Uses minimum word count threshold (10) to filter invalid segments - Format Optimization: Default output in markdown_with_citations format, convenient for LLM understanding and citation ## Development Notes Project structure: ``` crawl4ai_mcp_server/ ├── src/ │ ├── index.py # Server main implementation │ └── search.py # Search functionality implementation ├── config_demo.json # Configuration file example ├── pyproject.toml # Project configuration ├── requirements.txt # Dependency list └── README.md # Project documentation ``` ## Configuration 1. Copy the configuration example file: ```bash cp config_demo.json config.json ``` 2. To use Google search, configure API keys in config.json: ```json { "google": { "api_key": "your-google-api-key", "cse_id": "your-google-cse-id" } } ``` ## Changelog - 2025.02.08: Added search functionality, supporting DuckDuckGo (default) and Google search - 2025.02.07: Refactored project structure, implemented using FastMCP, optimized dependency management - 2025.02.07: Optimized content filtering configuration, improved token efficiency while maintaining URL integrity ## License MIT License ## Contributing Issues and Pull Requests are welcome! ## Author - Owner: weidwonder - Coder: Claude Sonnet 3.5 - 100% Code written by Claude. Cost: $9 ($2 for code writing, $7 cost for Debugging😭) - 3 hours time cost. 0.5 hours for code writing, 0.5 hours for env preparing, 2 hours for debugging.😭 ## Acknowledgments Thanks to all developers who contributed to the project! Special thanks to: - [Crawl4ai](https://github.com/crawl4ai/crawl4ai) project providing excellent webpage content extraction technology support

crawl4ai-mcp

Content

Connection Info

You Might Also Like

Filesystem

Fetch

Chrome Devtools MCP

mcp-chrome

Firecrawl

firecrawl

crawl4ai-mcp

Scan with WeChat to Share

Authentication Required

Content

Connection Info

You Might Also Like

Filesystem

Fetch

Chrome Devtools MCP

mcp-chrome

Firecrawl

firecrawl