Content
# mcp_query_table
1. A financial web table crawler implemented based on `playwright`, supporting `Model Context Protocol (MCP)`. Currently, the available sources are:
- [iWencai](http://iwencai.com/)
- [Tongdaxin Wenxiao Da](https://wenda.tdx.com.cn/)
- [Eastmoney Stock Screening](https://xuangu.eastmoney.com/)
During live trading, if a certain website goes down or undergoes a redesign, you can immediately switch to another website. (Note: The table structures of different websites vary, so adaptations need to be made in advance.)
2. A large language model calling crawler implemented based on `playwright`. The currently available sources are:
- [Nano Search](https://www.n.cn/)
- [Tencent Yuanbao](https://yuanbao.tencent.com/)
- [Baidu AI Search](https://chat.baidu.com/)
`RooCode` provides a `Human Reply` feature. However, it was found that the formatting is disrupted when copying from the web version of `Nano Search`, so this feature was developed.
## Installation
```commandline
pip install -i https://pypi.org/simple --upgrade mcp_query_table
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade mcp_query_table
```
## Usage
```python
import asyncio
from mcp_query_table import *
async def main() -> None:
async with BrowserManager(endpoint="http://127.0.0.1:9222", executable_path=None, devtools=True) as bm:
# To ensure the browser width > 768 to prevent the interface from adapting to mobile
page = await bm.get_page()
df = await query(page, 'Top 200 ETFs by Returns', query_type=QueryType.ETF, max_page=1, site=Site.THS)
print(df.to_markdown())
df = await query(page, 'Top 50 Year-to-Date Returns', query_type=QueryType.Fund, max_page=1, site=Site.TDX)
print(df.to_csv())
df = await query(page, 'Top 10 Sectors by Market Capitalization', query_type=QueryType.Index, max_page=1, site=Site.TDX)
print(df.to_csv())
# TODO Eastmoney pagination requires login in advance
df = await query(page, 'Top 5 Gaining Concept Sectors Today;', query_type=QueryType.Board, max_page=3, site=Site.EastMoney)
print(df)
output = await chat(page, "What is 1+2?", provider=Provider.YuanBao)
print(output)
output = await chat(page, "What is 3+4?", provider=Provider.YuanBao, create=True)
print(output)
print('done')
bm.release_page(page)
await page.wait_for_timeout(2000)
if __name__ == '__main__':
asyncio.run(main())
```
## Notes
1. It is recommended to use `Chrome` as the browser. If you must use `Edge`, in addition to closing all `Edge` windows, you also need to terminate all `Microsoft Edge` processes in the task manager, using `taskkill /f /im msedge.exe`.
2. Ensure the browser window width is sufficient to prevent some websites from automatically adapting to mobile versions, which may cause table queries to fail.
3. If you have an account on the website, please log in in advance. This tool does not have an automatic login feature.
4. The table structures of different websites vary, and the number of stocks returned under the same conditions may also differ. Adaptation is required after querying.
## Working Principle
Unlike `requests`, `playwright` is browser-based and simulates user operations in the browser.
1. No need to solve login issues
2. No need to handle request construction or response parsing
3. Can directly obtain table data, what you see is what you get
4. Running speed is slower than `requests`, but development efficiency is higher
Data acquisition methods include:
1. Directly parsing HTML tables
1. Numeric data is text-based, which is not conducive to later research
2. Most versatile
2. Intercepting requests to obtain returned `json` data
1. Similar to `requests`, requires response parsing
2. Slightly less flexible; after website redesigns, adaptations need to be redone
This project uses simulated clicks in the browser to send requests, employing response interception and parsing methods to obtain data.
In the future, more suitable methods will be used based on different website redesign situations.
## Headless Mode
Headless mode runs faster, but some websites require prior login. Therefore, it is essential to specify `user_data_dir` in headless mode; otherwise, you may encounter login prompts.
- When `endpoint=None`, `headless=True` allows you to launch a new browser instance in headless mode. You must specify `executable_path` and `user_data_dir` to ensure proper operation in headless mode.
- If `endpoint` starts with `http://`, it connects to a headful browser launched in `CDP` (Chrome DevTools Protocol) mode, and the parameter must include `--remote-debugging-port`. `executable_path` refers to the local browser path.
- If `endpoint` starts with `ws://`, it connects to a remote `Playwright Server`. This is also in headless mode, but you cannot specify `user_data_dir`, so usage is limited.
- Reference: https://playwright.dev/python/docs/docker#running-the-playwright-server
The security policy of the new version of `Chrome` will prevent the creation of a `CDP` service when using the default `user_data_dir`. It is recommended to copy the configuration directory to another location.
## MCP Support
Make sure you can execute `python -m mcp_query_table -h` in the console. If not, you may need to run `pip install mcp_query_table` first.
In `Cline`, you can configure as follows. The `command` is the absolute path of `python`, and `timeout` is the timeout duration in seconds. Since the response time on various AI platforms often exceeds 1 minute, a larger timeout duration needs to be set.
### STDIO Method
```json
{
"mcpServers": {
"mcp_query_table": {
"timeout": 300,
"command": "D:\\Users\\Kan\\miniconda3\\envs\\py312\\python.exe",
"args": [
"-m",
"mcp_query_table",
"--format",
"markdown",
"--endpoint",
"http://127.0.0.1:9222",
"--executable_path",
"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"
]
}
}
}
```
### SSE Mode
First, execute the following command in the console to start the `MCP` service
```commandline
python -m mcp_query_table --format markdown --transport sse --port 8000 --endpoint http://127.0.0.1:9222 --user_data_dir "D:\user-data-dir"
```
Then you can connect to the `MCP` service
```json
{
"mcpServers": {
"mcp_query_table": {
"timeout": 300,
"url": "http://127.0.0.1:8000/sse"
}
}
}
```
### Streamable HTTP Method
```commandline
python -m mcp_query_table --format markdown --transport streamable-http --port 8000 --endpoint http://127.0.0.1:9222 --user_data_dir "D:\user-data-dir"
```
The connection address is `http://127.0.0.1:8000/mcp`
## Debugging with `MCP Inspector`
```commandline
npx @modelcontextprotocol/inspector python -m mcp_query_table --format markdown --endpoint http://127.0.0.1:9222
```
Opening the browser and flipping through pages can be a time-consuming operation, which may cause the `MCP Inspector` page to time out. You can set the timeout to 300 seconds by using `http://localhost:5173/?timeout=300000`.
When attempting to write an `MCP` project for the first time, you may encounter various issues. Feel free to reach out for discussions.
## `MCP` Usage Tips
1. The top 100 stocks with the highest increase in 2024 are ranked by total market capitalization as of December 31, 2024. The results from three websites are different:
- Tonghuashun: Displays 2201 stocks. The top 5 are Industrial and Commercial Bank of China, Agricultural Bank of China, China Mobile, China Petroleum, and China Construction Bank.
- Tongdaxin: Displays 100 stocks, with the top 5 being Cambricon, Zhengdan Co., Huijin Technology, Wanfeng Aowei, and Airun Software.
- Eastmoney: Displays 100 stocks, with the top 5 being Haiguang Information, Cambricon, Guangqi Technology, Runze Technology, and New Yi Sheng.
2. Large language models have weak capabilities in problem decomposition, so it's important to ask questions reasonably to ensure that the query conditions are not altered. The following are recommended for the 2nd and 3rd options:
- Rank the top 100 stocks with the highest increase in 2024 by total market capitalization as of December 31, 2024.
> Large language models are very likely to decompose this sentence, resulting in a single query being split into multiple queries.
- Query Eastmoney for "the top 100 stocks with the highest increase in 2024 ranked by total market capitalization as of December 31, 2024."
> Enclosing in quotes helps avoid decomposition.
- Query the Eastmoney sector for "the worst-performing industry sectors last year," then query the top 5 stocks that performed best in that sector last year.
> This is split into two queries: first querying the sector, then querying the stocks. However, it's best not to automate this completely, as the model may not understand "today's increase" and "interval increase" in the first step, requiring interactive adjustments.
## Support for `Streamlit`
Implement querying financial data on the same page and manually inputting it into `AI` for in-depth analysis. Refer to the `README.md` file in the `streamlit` directory.

## References
- [Selenium webdriver cannot attach to edge instance, edge's --remote-debugging-port option is invalid](https://blog.csdn.net/qq_30576521/article/details/142370538)
- https://github.com/AtuboDad/playwright_stealth/issues/31
- https://github.com/browser-use/browser-use/issues/1520
Connection Info
You Might Also Like
MarkItDown MCP
Converting files and office documents to Markdown.
Filesystem
Model Context Protocol Servers
Sequential Thinking
Offers a structured approach to dynamic and reflective problem-solving,...
TrendRadar
🎯 Say goodbye to information overload. AI helps you understand news hotspots...
Github
GitHub's official MCP Server
opik
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic...