Content
# Export-Zhihu-Collections
## Project Introduction
A powerful Zhihu collection export tool that supports batch exporting Zhihu collections (public and private) to Markdown format files. It supports custom output paths, cross-platform compatibility, image downloading, and Obsidian-compatible formats. It supports one-click generation of article summaries using large model APIs.
**Also provides MCP Server**, which can be directly called by AI Agents (such as Claude Code) to provide large models with the ability to save Zhihu collections.
## Key Features
- 📚 **Batch Processing**: Supports processing multiple collections simultaneously
- 📂 **Custom Output**: Supports users specifying the output directory
- 🖥️ **Cross-Platform Compatibility**: Supports systems such as Windows, Linux, and macOS
- ✨ **Intelligent Deduplication**: Automatically checks for existing files to avoid duplicate downloads
- 🖼️ **Image Downloading**: Automatically downloads and saves images from articles
- 📝 **Obsidian Compatible**: The output Markdown format is fully compatible with Obsidian
- 📈 **Real-Time Logging**: Supports real-time log refreshing to immediately display processing progress
- 🔍 **Automatic Collection Acquisition**: Automatically retrieves the user's collection list through an independent script
- 🛠️ **Robust Error Handling**: Failure of a single article does not affect the overall processing flow
- 🔧 **Enhanced Debugging Support**: Automatically generates debugging files to facilitate troubleshooting
- 🤖 **MCP Support**: Provides MCP Server that can be called by AI Agents
---
## Two Ways to Run
### Method 1: Command Line Execution (Traditional Method)
Suitable for running export tasks directly in the terminal.
#### Installation
```bash
pip install -r requirements.txt
```
#### Configuration File Settings
First, create the main configuration file:
```bash
# Copy the configuration example file
cp config_examples.json config.json
```
Edit the `config.json` file to configure your collection information:
```json
{
"zhihuUrls": [
{
"name": "Collection Name",
"url": "https://www.zhihu.com/collection/123456789"
},
{
"name": "Another Collection",
"url": "https://www.zhihu.com/collection/987654321"
}
],
"outputPath": "/path/to/your/output/directory",
"os": "linux"
}
```
The acquisition of zhihuUrls can be seen in 
#### Automatically Get Collections (Optional)
If you want to automatically get all collections, you can run:
```bash
python fetch_collections.py
```
This will automatically update the collection list in the `config.json` file.
#### Run the Main Program
```bash
python main.py
```
---
### Method 2: MCP Server Execution (AI Agent Method)
Suitable for integration and calling by Claude Code or other AI tools.
#### Install MCP Dependencies
```bash
# Install the MCP package
pip install mcp
```
#### Start MCP Server
```bash
python mcp_server.py
```
The server communicates through stdio and can be used in Claude Code or other MCP clients.
#### Available Tools
| Tool Name | Description |
|--------|------|
| `list_collections` | Lists all Zhihu collections in the configuration file |
| `export_collection` | Exports the specified Zhihu collection as a Markdown file |
| `get_collection_info` | Gets basic information about the specified collection (number of articles, etc.) |
| `search_collections` | Searches for collections containing keywords in the configuration file |
##### list_collections
```python
# Returns a formatted list of collections
```
##### export_collection
Parameters:
- `collection_url` (required): Collection URL
- `collection_name` (optional): Collection name
- `output_dir` (optional): Output directory
```python
# Example
export_collection(
collection_url="https://www.zhihu.com/collection/123456789",
collection_name="Python Learning",
output_dir="my_downloads"
)
```
##### get_collection_info
Parameters:
- `collection_url` (required): Collection URL
##### search_collections
Parameters:
- `keyword` (required): Search keyword
#### Using in Claude Code
Add the server configuration in the MCP configuration section of the `CLAUDE.md` file in the project root directory or the user configuration file:
```json
{
"mcpServers": {
"zhihu-collections": {
"command": "python",
"args": ["mcp_server.py"],
"env": {},
"cwd": "Your project path"
}
}
}
```
Then you can use natural language to call:
- "List my Zhihu collections"
- "Export collection https://www.zhihu.com/collection/123456789"
- "Search for collections containing Python"
- "Get the number of articles in collection https://www.zhihu.com/collection/123456789"
---
## Configuration Options
### Basic Configuration
- **zhihuUrls**: Collection list, each collection contains a name and URL
- **outputPath**: Custom output path (optional, leave blank to use the default path)
- **os**: Operating system type (optional, leave blank to automatically detect)
### Supported Operating Systems and Path Formats
- **Windows**: `"D:\\Documents\\ZhihuExports"` or `"D:/Documents/ZhihuExports"`
- **Linux/Unix**: `"/usr/local/share/zhihu-exports"`
- **macOS**: `"~/Documents/ZhihuExports"`
- **Cygwin**: `"/cygdrive/d/Documents/ZhihuExports"`
### Private Collection Access
For private collections, you need to create a `cookies.json` file:
```json
[
{"name": "cookie_name", "value": "cookie_value"},
{"name": "another_cookie", "value": "another_value"}
]
```
## Output Results
### File Structure
```
Output Directory/
├── Collection Name 1/
│ ├── Article 1.md
│ ├── Article 2.md
│ └── assets/
│ ├── image1.jpg
│ └── image2.png
├── Collection Name 2/
├── logs/
│ ├── debug_20240101_120000.log
│ └── 20240101_120000.json
└── debug/
├── debug_answer_123456.html
└── debug_post_789012.html
```
### Default Output Location
- **Default Path**: `downloads/` folder in the project directory
- **Custom Path**: Specify `outputPath` in `config.json`
- **Image Storage**: `assets/` folder under each collection directory
- **Log Files**: `logs/` folder under the output directory
- **Debug Files**: `debug/` folder under the output directory (saves unparsable page HTML)
## Troubleshooting
### Common Issues
1. **Content Download Failed**
- View the HTML file in the `downloads/debug/` directory to analyze the page structure
- Check network connection and whether cookies are valid
- Confirm whether the article URL can be accessed normally
2. **Log File is Empty**
- The program has fixed the real-time log refresh issue
- If there are still problems, check the write permissions of the output directory
3. **TypeError: cannot unpack non-iterable NoneType object**
- This issue has been fixed in v2.1
- If you still encounter it, please update to the latest version
4. **Column articles return "This article link is 404" but the browser can open it normally**
- v2.1 adds intelligent content detection and accurate error analysis
- Check the HTML file in the debug directory to understand the specific reason
- You may need to update cookies or the page structure has changed
### Debugging Support
- **Log Files**: `downloads/logs/debug_*.log` contains detailed processing information
- **Debug HTML**: `downloads/debug/debug_*.html` saves unparsable pages
- **Test Scripts**: The `test/` directory contains various functional test scripts
# BUG Feedback
If you encounter any problems during use, please provide BUG information in the issue. In order to facilitate my reproduction and resolution of the problem, please be sure to attach the error message or related website.
# Suggestions
If you have any suggestions, you are welcome to start a discussion in the issue
## Changelog
### v2.2 MCP Support
- 🤖 **MCP Server**: Added `mcp_server.py` to support direct calls by AI Agents
- 📋 **4 Tools**: Provides list_collections / export_collection / get_collection_info / search_collections tools
- 🔌 **Standard Protocol**: Follows the MCP protocol and supports integration with mainstream AI tools such as Claude Code
### v2.1 Enhancements
- 🚀 **Real-Time Logging System**: Supports real-time log refreshing and immediate display of processing progress
- 🛠️ **Robust Error Handling**: Fixed key errors such as TypeError, failure of a single article does not affect overall processing
- 🔍 **Enhanced HTML Parsing**: Supports multiple Zhihu page structures to improve content acquisition success rate
- 🔧 **Debug File Generation**: Automatically saves unparsable page HTML to the `debug/` directory for analysis
- 🔄 **Automatic Collection Acquisition**: Added `fetch_collections.py` independent script to automatically retrieve collection list
- 📊 **Detailed Error Logs**: Records specific error information and stack traces to facilitate problem location
- 🎯 **Intelligent Content Detection**: When the standard CSS selector fails, automatically enable intelligent algorithm to detect article content
- 🔍 **Accurate Error Analysis**: Distinguishes between different error types such as 404, login requirements, and permission issues
### v2.0 New Features
- ✨ Batch processing of multiple collections
- ⚙️ Configuration file system (config.json)
- 📂 Custom output path support
- 🖥️ Cross-platform path processing
- ✨ Intelligent file deduplication function
- 📈 Detailed processing log system
- 🔄 Backwards compatible with old configuration files
## Todo
- Optimize crawling speed
- Add more export format support
- GUI interface development
- Third-party large model API support
MCP Config
Below is the configuration for this MCP Server. You can copy it directly to Cursor or other MCP clients.
mcp.json
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
everything-claude-code
Complete Claude Code configuration collection - agents, skills, hooks,...
awesome-claude-skills
A curated list of awesome Claude Skills, resources, and tools for...
claude-historian-mcp
🤖 An MCP server for Claude Code conversation history
x-mcp
X-MCP is a browser plugin for easy deployment and automation, currently in...
claude-historian
A MCP server for searching and learning from your Claude Code conversation history.