Content
<h1 align="center">Crawl4AI RAG MCP Server</h1>
<p align="center">
<em>Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants</em>
</p>
A powerful implementation of the [Model Context Protocol (MCP)](https://modelcontextprotocol.io) integrated with [Crawl4AI](https://crawl4ai.com) and [Supabase](https://supabase.com/) for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.
The primary goal is to bring this MCP server into [Archon](https://github.com/coleam00/Archon) as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama.
## Overview
This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the [Mem0 MCP server template](https://github.com/coleam00/mcp-mem0/) I provided on my channel previously.
## Features
- **Smart URL Detection**: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
- **Recursive Crawling**: Follows internal links to discover content
- **Parallel Processing**: Efficiently crawls multiple pages simultaneously
- **Content Chunking**: Intelligently splits content by headers and size for better processing
- **Vector Search**: Performs RAG over crawled content, optionally filtering by data source for precision
- **Source Retrieval**: Retrieve sources available for filtering to guide the RAG process
## Tools
The server provides four essential web crawling and search tools:
1. **`crawl_single_page`**: Quickly crawl a single web page and store its content in the vector database
2. **`smart_crawl_url`**: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)
3. **`get_available_sources`**: Get a list of all available sources (domains) in the database
4. **`perform_rag_query`**: Search for relevant content using semantic search with optional source filtering
## Prerequisites
- Python 3.12+
- Supabase
- OpenAI API key (for generating embeddings)
- Docker/Docker Desktop if running the MCP server as a container (recommended)
## Installation
### Using Docker (Recommended)
1. Build the Docker image:
```bash
docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
```
2. Create a `.env` file based on the configuration section below
### Using uv directly (no Docker)
1. Install uv if you don't have it:
```bash
pip install uv
```
2. Clone this repository:
```bash
git clone https://github.com/coleam00/mcp-crawlai-rag.git
cd mcp-crawlai-rag
```
3. Create and activate a virtual environment:
```bash
uv venv
.venv\Scripts\activate
# on Mac/Linux: source .venv/bin/activate
```
4. Install dependencies:
```bash
uv pip install -e .
crawl4ai-setup
```
5. Create a `.env` file based on the configuration section below
## Database Setup
Before running the server, you need to set up the database with the pgvector extension:
1. Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
2. Create a new query and paste the contents of `crawled_pages.sql`
3. Run the query to create the necessary tables and functions
## Configuration
Create a `.env` file in the project root with the following variables:
```
# MCP Server Configuration
HOST=0.0.0.0
PORT=8051
TRANSPORT=sse
# OpenAI API Configuration
OPENAI_API_KEY=your_openai_api_key
# Supabase Configuration
SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
```
## Running the Server
### Using Docker
```bash
docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag
```
### Using Python
```bash
uv run src/crawl4ai_mcp.py
```
The server will start and listen on the configured host and port.
## Integration with MCP Clients
### SSE Configuration
Once you have the server running with SSE transport, you can connect to it using this configuration:
```json
{
"mcpServers": {
"crawl4ai-rag": {
"transport": "sse",
"url": "http://localhost:8051/sse"
}
}
}
```
> **Note for Windsurf users**: Use `serverUrl` instead of `url` in your configuration:
> ```json
> {
> "mcpServers": {
> "crawl4ai-rag": {
> "transport": "sse",
> "serverUrl": "http://localhost:8051/sse"
> }
> }
> }
> ```
>
> **Note for Docker users**: Use `host.docker.internal` instead of `localhost` if your client is running in a different container. This will apply if you are using this MCP server within n8n!
### Stdio Configuration
Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client:
```json
{
"mcpServers": {
"crawl4ai-rag": {
"command": "python",
"args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"],
"env": {
"TRANSPORT": "stdio",
"OPENAI_API_KEY": "your_openai_api_key",
"SUPABASE_URL": "your_supabase_url",
"SUPABASE_SERVICE_KEY": "your_supabase_service_key"
}
}
}
}
```
### Docker with Stdio Configuration
```json
{
"mcpServers": {
"crawl4ai-rag": {
"command": "docker",
"args": ["run", "--rm", "-i",
"-e", "TRANSPORT",
"-e", "OPENAI_API_KEY",
"-e", "SUPABASE_URL",
"-e", "SUPABASE_SERVICE_KEY",
"mcp/crawl4ai"],
"env": {
"TRANSPORT": "stdio",
"OPENAI_API_KEY": "your_openai_api_key",
"SUPABASE_URL": "your_supabase_url",
"SUPABASE_SERVICE_KEY": "your_supabase_service_key"
}
}
}
}
```
## Building Your Own Server
This implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own:
1. Add your own tools by creating methods with the `@mcp.tool()` decorator
2. Create your own lifespan function to add your own dependencies
3. Modify the `utils.py` file for any helper functions you need
4. Extend the crawling capabilities by adding more specialized crawlers