crawl4ai-mcp-server

weidwonder

# MCP Server An efficient internet search & content retrieval server for LLM (大语言模型), designed for local developers, saving your tokens.

Overview

crawl4ai-mcp-server Introduction

crawl4ai-mcp-server is an efficient MCP Server designed for local developers, providing powerful internet search and content retrieval capabilities optimized for LLMs, helping to save your tokens.

How to Use

To use crawl4ai-mcp-server, clone the repository, set up a virtual environment, install dependencies, and optionally configure it for use with the Claude desktop client via Smithery. You can utilize tools like 'search' for web queries and 'read_url' for content extraction.

Key Features

Key features include multi-engine search capabilities supporting DuckDuckGo and Google, LLM-optimized web content extraction, automatic identification of key content, various output formats with citation tracing, and high-performance asynchronous design based on FastMCP.

Where to Use

crawl4ai-mcp-server can be used in various fields such as AI development, data analysis, content generation, and any application requiring efficient web information retrieval and processing.

Use Cases

Use cases include enhancing AI assistants with real-time information retrieval, optimizing content understanding for LLMs, and providing developers with tools to efficiently gather and process web data.

Content

# Crawl4AI MCP Server [![smithery badge](https://smithery.ai/badge/@weidwonder/crawl4ai-mcp-server)](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server) This is an intelligent information retrieval server based on MCP (Model Context Protocol), providing powerful search capabilities and web content understanding optimized for LLM (Large Language Model) for AI assistant systems. Through multi-engine search and intelligent content extraction, it helps AI systems efficiently acquire and understand internet information, converting web content into the most suitable format for LLM processing. ## Features - 🔍 Powerful multi-engine search capabilities, supporting DuckDuckGo and Google - 📚 Web content extraction optimized for LLM, intelligently filtering out non-core content - 🎯 Focused on information value, automatically identifying and retaining key content - 📝 Multiple output formats, supporting citation tracing - 🚀 High-performance asynchronous design based on FastMCP ## Installation ### Method 1: Most installation scenarios 1. Ensure your system meets the following requirements: - Python >= 3.9 - It is recommended to use a dedicated virtual environment 2. Clone the repository: ```bash git clone https://github.com/yourusername/crawl4ai-mcp-server.git cd crawl4ai-mcp-server ``` 3. Create and activate a virtual environment: ```bash python -m venv crawl4ai_env source crawl4ai_env/bin/activate # Linux/Mac # or .\crawl4ai_env\Scripts\activate # Windows ``` 4. Install dependencies: ```bash pip install -r requirements.txt ``` 5. Install the playwright browser: ```bash playwright install ``` ### Method 2: Install to Claude Desktop Client via Smithery Automatically configure the Crawl4AI MCP's Claude desktop service to your local `Claude Hub` via [Smithery](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server): ```bash npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude ``` ## Usage The server provides the following tools: ### search A powerful web search tool that supports multiple search engines: - DuckDuckGo search (default): No API key required, fully processes AbstractText, Results, and RelatedTopics - Google search: Requires API key configuration, provides accurate search results - Supports using multiple engines simultaneously for more comprehensive results Parameter description: - `query`: Search query string - `num_results`: Number of results to return (default 10) - `engine`: Search engine selection - "duckduckgo": DuckDuckGo search (default) - "google": Google search (requires API key) - "all": Use all available search engines simultaneously Example: ```python # DuckDuckGo search (default) { "query": "python programming", "num_results": 5 } # Using all available engines { "query": "python programming", "num_results": 5, "engine": "all" } ``` ### read_url A web content understanding tool optimized for LLM, providing intelligent content extraction and format conversion: - `markdown_with_citations`: Markdown with inline citations (default), maintaining information tracing - `fit_markdown`: LLM-optimized concise content, removing redundant information - `raw_markdown`: Basic HTML to Markdown conversion - `references_markdown`: Separate references section - `fit_html`: Filtered HTML generated from fit_markdown - `markdown`: Default Markdown format Example: ```python { "url": "https://example.com", "format": "markdown_with_citations" } ``` Example: ```python # DuckDuckGo search (default) { "query": "python programming", "num_results": 5 } # Google search { "query": "python programming", "num_results": 5, "engine": "google" } ``` To use Google search, you need to configure the API key in config.json: ```json { "google": { "api_key": "your-api-key", "cse_id": "your-cse-id" } } ``` ## LLM Content Optimization The server employs a series of content optimization strategies for LLM: - Intelligent content recognition: Automatically identifies and retains the main body of the article and key information paragraphs - Noise filtering: Automatically filters out navigation bars, ads, footers, and other content that is not helpful for understanding - Information integrity: Retains URL references, supporting information tracing - Length optimization: Filters out invalid segments using a minimum word count threshold (10) - Format optimization: Defaults to outputting in markdown_with_citations format, facilitating LLM understanding and citation ## Development Instructions Project structure: ``` crawl4ai_mcp_server/ ├── src/ │ ├── index.py # Main implementation of the server │ └── search.py # Implementation of search functionality ├── config_demo.json # Configuration file example ├── pyproject.toml # Project configuration ├── requirements.txt # Dependency list └── README.md # Project documentation ``` ## Configuration Instructions 1. Copy the configuration example file: ```bash cp config_demo.json config.json ``` 2. If you want to use Google search, configure the API key in config.json: ```json { "google": { "api_key": "your-google-api-key", "cse_id": "your-google-cse-id" } } ``` ## Changelog - 2025.02.08: Added search functionality, supporting DuckDuckGo (default) and Google search - 2025.02.07: Refactored project structure, implemented using FastMCP, optimized dependency management - 2025.02.07: Optimized content filtering configuration, improved token efficiency while maintaining URL integrity ## License MIT License ## Contribution Contributions in the form of Issues and Pull Requests are welcome! ## Authors - Owner: weidwonder - Coder: Claude Sonnet 3.5 - 100% Code written by Claude. Cost: $9 ($2 for code writing, $7 cost for Debugging😭) - 3 hours time cost. 0.5 hours for code writing, 0.5 hours for environment preparation, 2 hours for debugging.😭 ## Acknowledgments Thanks to all developers who contributed to the project! Special thanks to: - [Crawl4ai](https://github.com/crawl4ai/crawl4ai) project for providing excellent web content extraction technology support.

crawl4ai-mcp-server

Scan with WeChat to Share

Overview

crawl4ai-mcp-server Introduction

How to Use

Key Features

Where to Use

Use Cases

Content

You Might Also Like

MarkItDown MCP

semantic-kernel

repomix

Context 7

Github MCP

python-sdk