Content

<img src="docs/favicon.svg" width="96" alt="Clawd Cursor"> <h1 align="center">Clawd Cursor</h1> A cursor and a keyboard for any AI agent on a real desktop. Any model. Any app. One MCP entry. Local-only. <a href="LICENSE"><img src="https://img.shields.io/github/license/AmrDab/clawdcursor?color=a855f7" alt="MIT license"></a> <a href="https://github.com/AmrDab/clawdcursor/releases/latest"><img src="https://img.shields.io/github/v/release/AmrDab/clawdcursor?color=22c55e&label=release" alt="Latest release"></a> <img src="https://img.shields.io/badge/node-%E2%89%A520-339933?logo=nodedotjs&logoColor=white" alt="Node 20+"> <img src="https://img.shields.io/badge/platform-win%20%7C%20mac%20%7C%20linux-0ea5e9" alt="Cross-platform"> <a href="https://github.com/AmrDab/clawdcursor/actions/workflows/cross-platform.yml"><img src="https://img.shields.io/github/actions/workflow/status/AmrDab/clawdcursor/cross-platform.yml?branch=main&label=tests" alt="Tests"></a> <a href="https://github.com/AmrDab/clawdcursor/actions/workflows/codeql.yml"><img src="https://img.shields.io/github/actions/workflow/status/AmrDab/clawdcursor/codeql.yml?branch=main&label=codeql" alt="CodeQL"></a> <a href="https://discord.gg/hW29nrEZ8G"><img src="https://img.shields.io/badge/Discord-join-5865F2?logo=discord&logoColor=white" alt="Discord"></a> <a href="#quickstart">Quickstart</a> · <a href="#why-clawd-cursor">Why</a> · <a href="#how-it-thinks">How it thinks</a> · <a href="#tool-surface">Tools</a> · <a href="#platform-support">Platforms</a> · <a href="CHANGELOG.md">Changelog</a> · <a href="SKILL.md">SKILL.md (AI-facing manual)</a> --- > **AI agents looking for the machine-readable manual: open [`SKILL.md`](SKILL.md).** This README is the human pitch; SKILL.md is the dense second-person doc written for an LLM. Clawd Cursor is a **skill**, not an app. Install it once. Any tool-calling agent on the machine — Claude Code, Cursor, Windsurf, OpenClaw, Claude Agent SDK, your own loop — picks up the tools through MCP. The agent then clicks, types, reads the screen, opens apps, and drives any GUI the same way a human would. > **If a human can do it on a screen, your AI can do it too.** No API? No integration? No problem. > > **No task is impossible.** GUI plus a mouse plus a keyboard equals everything you need. There is no "I can't do that in this app" — only the right sequence of reads, clicks, keys, and waits. Clawd Cursor gives you all of them. It's **model-agnostic** (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), **app-agnostic** (drives any window via accessibility, OCR, or vision fallback), and **OS-agnostic** (one `PlatformAdapter` covers Windows, macOS, Linux X11, and Linux Wayland). --- ## Quickstart Sixty seconds from zero to a tool-calling agent on your desktop. **Pick your mode first:** | Your situation | Use | Why | |---|---|---| | AI lives in your editor (Claude Code, Cursor, Windsurf, Zed) | **`clawdcursor mcp`** | stdio MCP server. Editor spawns it on demand. No daemon, no port. | | You're building an agent that runs unattended | **`clawdcursor agent`** | HTTP MCP daemon on `127.0.0.1:3847`. Has its own LLM brain optionally configured via `doctor`. | | Your agent has its own brain — you just want the tools as an HTTP endpoint | **`clawdcursor agent --no-llm`** | Same daemon, no built-in pipeline, no scheduler startup, no credential validation. Pure tool surface. | **Windows (PowerShell):** ```powershell irm https://clawdcursor.com/install.ps1 | iex ``` **macOS / Linux:** ```bash curl -fsSL https://clawdcursor.com/install.sh | bash ``` Then verify and configure: ```bash clawdcursor --version # smoke-test the install clawdcursor consent --accept # one-time desktop-control consent (required) clawdcursor status # cross-check permissions + AI config clawdcursor doctor # (optional) configure an LLM provider end-to-end clawdcursor agent # OR `clawdcursor mcp` — see the table above ``` The installer clones into `~/clawdcursor`, runs `npm install`, builds, and `npm link`s a global shim. Runtime state lives at `~/.clawdcursor/` (auth token, pidfiles, logs). It does **not** edit any agent host config — that step is below. Wire it into Claude Code, Cursor, Windsurf, or Zed: ```jsonc // ~/.claude/settings.json (or your editor's MCP config) { "mcpServers": { "clawdcursor": { "command": "clawdcursor", "args": ["mcp", "--compact"] } } } ``` That's it. Ask your agent to *"open Outlook and reply to the latest email from Sarah"* and watch it run. > **macOS:** run `clawdcursor grant` to walk through Accessibility + Screen Recording permissions. > **Linux:** install `tesseract-ocr`, `python3-gi`, `gir1.2-atspi-2.0`, and (Wayland only) `ydotool` or `wtype`. --- ## Why Clawd Cursor - **Works where APIs don't exist.** Native apps. Legacy enterprise tools. Web portals behind SSO that block headless browsers. Anything inside Citrix or RDP. If pixels reach the screen, your agent can drive it. - **Model-agnostic.** Claude, GPT, Gemini, Llama, Kimi, anything local via Ollama — any tool-calling LLM. Text and vision can be different models from different vendors. - **App-agnostic.** No per-app plugins, no per-service auth. The same six compound tools drive Outlook, Figma, your bank, and that 2003-era ERP. - **Cheapest-tier-first pipeline** (for natural-language tasks). When you hand a whole instruction to the autonomous agent via `task({...})` / `submit_task`, the planner starts at accessibility (free), escalates to OCR (cheap), then screenshot (medium), then vision (expensive). The Reflector feeds verifier signals back so the loop doesn't keep paying for vision when text would work. **Direct tool calls** (when your own agent loop drives the compound or granular tools) skip the pipeline and go straight to the requested tool — you bring the planning, clawdcursor brings the primitives. - **Local-only by default.** Server binds to `127.0.0.1`. Screenshots stay in RAM unless you point a cloud model at them. No telemetry. - **One protocol, two transports.** MCP over stdio for editor hosts; MCP over HTTP for daemons. Same tool catalog, same JSON-RPC envelope. ### When NOT to use it Clawd Cursor is GUI control. It's slower than an API, less reliable than a script, and burns more tokens than a direct file edit. If a better path exists, take it: | Better option | When | |---|---| | Native API (Gmail API, GitHub API, Stripe API, …) | The service has one. Use it. | | CLI (`git`, `gh`, `aws`, `npm`, `curl`, `sqlite3`) | The work fits a shell tool. Use it. | | Direct file edit | The data lives in a file you can write. Edit it. | | Browser automation already wired up (Playwright, Puppeteer) for this exact site | Faster, more deterministic. Use it. | **Reach for Clawd Cursor when none of those apply** — the legacy ERP with no REST, the Electron app whose dialog can't be scripted, the Excel macro behind a Citrix session. The pipeline pays the cheap costs first; if structured paths work it never escalates to vision. But the design point is the **last mile**, not the first call. (SKILL.md enforces this as a hard 4-gate rule for AI agents calling the tool surface. Humans get the softer table above.) --- ## How It Thinks Two execution paths exist, and the difference matters: - **Direct tool calls** (stdio MCP from editor hosts, or HTTP MCP `tools/call` for a single tool) dispatch straight to the tool implementation. They are gated only by `safety.evaluate()` — the tier check and blocked-combo enforcement. There is no router, no verifier, no escalation. Your agent already knows what it wants to do; clawdcursor just runs it. - **Autonomous tasks** (via `task({instruction})` / `submit_task` against the daemon) flow through the ladder below. The pipeline picks the cheapest rung that works and only escalates when the ground-truth verifier disagrees with the planner's claim of success. The diagram below describes the **autonomous-task path only.** Direct calls go through the `safety.evaluate()` → `tool` segment and stop there. ```mermaid flowchart LR user["User task"] --> pre["Preprocessor (strategy + subtasks)"] pre --> router["Router (regex shortcuts, zero LLM)"] router -- match --> tool["safety.evaluate() → tool"] router -- miss --> blind["Blind (a11y tree only)"] blind --> tool blind -- sparse a11y / stagnation --> hybrid["Hybrid (a11y + screenshot on demand)"] hybrid --> tool hybrid -- still stuck --> vision["Vision (screenshot every turn)"] vision --> tool tool --> verifier{"Ground-truth verifier"} verifier -- pass --> done["done"] verifier -- fail --> reflector["Reflector (structured cause + suggested strategy)"] reflector -. feedback .-> pre reflector -. hint .-> blind reflector -. hint .-> hybrid reflector -. hint .-> vision classDef rung fill:#0ea5e9,stroke:#0369a1,color:#fff; classDef gate fill:#a855f7,stroke:#6b21a8,color:#fff; classDef refl fill:#eab308,stroke:#854d0e,color:#000; class router,blind,hybrid,vision rung; class tool,verifier gate; class reflector refl; ``` **Single safety chokepoint.** Every tool call — direct OR autonomous — routes through `safety.evaluate()`. The MCP server invokes it before dispatch; the granular and compound tool tables share the same gate. This is the one decision layer that genuinely is universal: there is no path for a tool to execute that skips it. **Ground-truth verification.** When the agent claims a task is done, six independent signals are checked against the post-task screen: pixel diff, window-state change, focus change, OCR delta, task-type assertions (`send_email`, `navigate_url`, `open_app`, …), and anti-pattern detection (error dialogs, auth failures, "draft saved"). Weighted voting with hard-fail rules. No LLM self-report. **Reflector loop.** On a verifier fail, the Reflector emits a structured `Cause` (e.g. `wrong_window_focused`, `modal_intercept`, `a11y_target_missing`, `webview_blind`) plus a suggested next strategy. The pipeline ladder consumes that signal to override its default escalation, and a one-line hint is injected as a synthetic `tool_result` so the planner understands *why* it's escalating. **Runaway guard.** Three identical calls in six turns and the loop exits with a targeted diagnostic — usually pointing at `detect_webview` when the target is Electron or WebView2 with a sparse accessibility tree. --- ## Transports One protocol — **MCP** — two transports. Same catalog, same JSON-RPC envelope. | Transport | When to use | Client config | |---|---|---| | **stdio MCP** | Editor hosts: Claude Code, Cursor, Windsurf, Zed. Tools appear on demand — no daemon. | `{"command": "clawdcursor", "args": ["mcp", "--compact"]}` | | **HTTP MCP** | Bring-your-own-agent, headless daemons, multi-process orchestration, Claude Agent SDK. POST JSON-RPC to `http://127.0.0.1:3847/mcp`. | Run `clawdcursor agent`. Then `tools/list` returns the catalog and `tools/call` invokes any tool. Bearer token at `~/.clawdcursor/token`. | Both transports are stateless. No session-init handshake. Bearer-token auth on every HTTP request; stdio inherits the parent process's trust. ```bash # HTTP MCP — list tools curl -s -X POST http://127.0.0.1:3847/mcp \ -H "Authorization: Bearer $(cat ~/.clawdcursor/token)" \ -H "Content-Type: application/json" \ -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' ``` --- ## Tool Surface Two catalogs, side by side. Agents pick the shape that fits. ### Compact — 6 compound tools (recommended) Anthropic `computer_20250124`-style: one tool per capability, an `action` enum for the verb. The compact catalog is roughly an order of magnitude smaller than the granular surface, which keeps small models (Haiku, Kimi, Ollama) focused on the action choice instead of drowning in primitives. Default for every agent that doesn't explicitly need one schema per primitive. | Tool | Most-used actions | |---|---| | `computer` | `screenshot`, `click`, `double_click`, `right_click`, `triple_click`, `hover`, `scroll`, `scroll_horizontal`, `drag`, `drag_path`, `type`, `key`, `wait` | | `accessibility` | `read_tree`, `find`, `get_element`, `focused`, `invoke`, `focus`, `set_value`, `get_value`, `expand`, `collapse`, `toggle`, `select`, `state`, `list_children`, `wait_for` | | `window` | `list`, `active`, `focus`, `maximize`, `minimize`, `restore`, `close`, `resize`, `list_displays`, `screen_size`, `open_app`, `open_file`, `open_url`, `switch_tab`, `navigate` | | `system` | `clipboard_read`, `clipboard_write`, `system_time`, `ocr`, `undo`, `shortcuts_list`, `shortcuts_run`, `delegate`, `detect_webview`, `relaunch_with_cdp`, `app_guide`, `detect_app`, `classify_task`, `system_prompt` | | `browser` | `connect`, `page_context`, `read_text`, `click`, `type`, `select_option`, `evaluate`, `wait_for`, `list_tabs`, `switch_tab`, `scroll` | | `task` | `{instruction: string}` — hand off the whole task to the pipeline. No `action` enum. | ### Granular — 97 individual tools One schema per verb. Use this when your runtime requires every primitive as a top-level tool. The full catalog is visible through MCP `tools/list` on either transport. A typical turn: ```js // Compact — recommended computer({ action: "key", combo: "mod+s" }) // resolves to Cmd+S / Ctrl+S accessibility({ action: "invoke", name: "Send" }) window({ action: "open_app", name: "Outlook" }) system({ action: "ocr" }) // OS-level OCR, no LLM vision task({ instruction: "open Notepad and type hello" }) // full pipeline ``` --- ## Guides Marketplace For unfamiliar apps, the agent reasons from screenshots and the a11y tree — slow but always works. For popular apps, **community-curated guides** ship the keyboard shortcuts, workflow patterns, layout cues, and failure modes the agent would otherwise have to discover by failing first. Loading a guide for an app it knows speeds operation 5–10×. - **Public registry + source repo: <https://github.com/AmrDab/clawdcursor-guides>** — community PRs welcome - **Verified seed guides:** discord, excel, figma, gmail, mspaint, olk (new Outlook), outlook, slack, spotify, youtube - **Bundled core (offline fallback):** msedge, notepad Guides are fetched on demand, cached locally for 7 days, LRU-evicted at 50 entries. The cache lives at `~/.clawdcursor/guide-cache/`. The agent never blocks on the network — if a guide isn't local and the registry is unreachable, it falls back to first-principles reasoning. ```bash clawdcursor guides available # browse the public registry clawdcursor guides install youtube # pre-warm cache for one app clawdcursor guides list # show cached + ratings clawdcursor guides info youtube # details for one cached guide clawdcursor guides refresh youtube # force re-fetch clawdcursor guides submit my-app.json # lint + print PR instructions ``` Every guide passes through a client-side linter on every load — schema check + prompt-injection patterns + dangerous-prose detection. A guide that fails lint is dropped and the agent falls back to no-knowledge, never poisoned-knowledge. Same linter runs as the registry's CI check on every PR. Voting: each guide has a `vote: <app>` issue on the source repo. React 👍 / 👎. A nightly job aggregates reactions into `index.json` so `clawdcursor guides list` shows ratings. See [`docs/guide-marketplace.md`](docs/guide-marketplace.md) for the full architecture, trust model, and CI flow. --- ## Cost Tiers The pipeline picks the cheapest rung that works. Apply the same logic when you call compound tools by hand. | Tier | Label | Cost | Source | When to use | |---|---|---|---|---| | **T1** | structured | ~free | `accessibility.*`, `window.*`, `browser.read_text`, clipboard | Default. Returns text + bounds — no image, no vision LLM. | | **T2** | ocr | cheap | `system({"action":"ocr"})` | A11y tree empty or sparse. OS-level OCR — text out, no LLM vision. | | **T3** | screenshot | medium | `computer({"action":"screenshot"})` | OCR isn't enough and you need pixel context. Sends an image into LLM context. | | **T4** | vision | expensive | `smart_click`, `smart_read`, `smart_type` | Canvas-only apps (Paint, Figma, games) or spatial reasoning that text can't express. Last resort. | **Rule: start at T1. Escalate only when the current tier fails.** `task({...})` does this automatically; the Reflector tells the planner *which* tier to jump to. --- ## Platform Support Platform-specific code lives in `src/platform/{windows,macos,linux}.ts` (plus `wayland-backend.ts`) behind a single `PlatformAdapter` interface. Business logic never reads `process.platform`. | Platform | UI Automation | OCR | Browser (CDP) | Input | |---|---|---|---|---| | **Windows** 10/11 (x64 / ARM64) | UIA via PowerShell bridge | `Windows.Media.Ocr` | Chrome / Edge | nut-js | | **macOS** 12+ (Intel / Apple Silicon) | JXA + System Events (TCC-safe) | Apple Vision | Chrome / Edge | nut-js + System Events | | **Linux** X11 | AT-SPI via `python3-gi` | Tesseract | Chrome / Edge | nut-js | | **Linux** Wayland | AT-SPI via `python3-gi` | Tesseract | Chrome / Edge | `ydotool` / `wtype` | Per-OS setup notes: - **Windows** — no setup. PowerShell bridge spawns on demand. - **macOS** — first run needs Accessibility + Screen Recording in `System Settings > Privacy & Security`. `clawdcursor grant` walks the dialogs. Retina / HiDPI handled in the adapter; **do not pre-scale coordinates**. - **Linux X11** — `apt install tesseract-ocr python3-gi gir1.2-atspi-2.0` (or your distro's equivalent). - **Linux Wayland** — same a11y packages, plus `ydotool` + a running `ydotoold` daemon (preferred) or `wtype` (keyboard only). --- ## Architecture Five directories. Everything else is a leaf module. | Directory | What lives here | |---|---| | `src/core/` | Pipeline orchestrator, agent loop, router, preprocessor, sense (a11y/snapshot/fingerprint), classify, decompose, skills cache, safety gate, ground-truth verifier, Reflector. | | `src/tools/` | The 97 granular tools + 6 compound aggregators, playbooks (`compose-send`, `find-replace`), tool registry, dispatch. | | `src/platform/` | `PlatformAdapter` interface + Windows / macOS / Linux / Wayland implementations, OCR engine, CDP driver, URI handler. | | `src/llm/` | Provider clients (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), credentials, model config, guide loader. | | `src/surface/` | CLI (`clawdcursor`), MCP server (stdio + HTTP), dashboard, doctor, onboarding, readiness probes. | The `PlatformAdapter` is the only thing platform code talks to. The `safety.evaluate()` chokepoint is the only way tools execute. Those two seams are the whole point of the v0.9 reorganization. --- ## Safety & Privacy `safety.evaluate()` returns one of three decisions per call. Tools also carry a tier category (`read` / `input` / `destructive` / `system`) that the evaluator uses to make its decision — categorization on the left, runtime outcome on the right. | Tool tier | Examples | Decision | |---|---|---| | `read` | `accessibility.read_tree`, `window.list`, `system.clipboard_read`, `system.ocr`, `computer.screenshot` | **Allow** — executes immediately | | `input` | `computer.type`, `computer.click`, `accessibility.invoke`, `window.open_app` | **Allow** — executes immediately; sensitive-app heuristic can elevate to Confirm | | `destructive` | Sends, deletes, purchases, transfers, dangerous key combos (`mod+w`, `mod+q`, …) | **Confirm** — pauses for user approval before executing | | `system` (blocked subset) | `Alt+F4` / `Cmd+Q` of the agent shell, `Ctrl+Alt+Delete`, `Shift+Delete`, power chords | **Block** — refused outright with a Tier-named error | Hardening summary: - **Network isolation.** Server binds to `127.0.0.1`. Verify with `netstat -an | findstr 3847` (Windows) or `| grep 3847` (Unix). - **Bearer-token auth.** Every HTTP request needs `Authorization: Bearer $(cat ~/.clawdcursor/token)`. - **Sensitive-app policy.** Email, banking, password managers, private messaging auto-elevate to Confirm. The agent must ask the user before acting on these surfaces. - **No telemetry.** Screenshots stay in RAM. With Ollama or any local model, nothing leaves the machine. With a cloud provider, screenshots go only to the endpoint you configured. - **Prompt-injection defense.** Screen text returned inside `<untrusted-screen-content>` tags is treated as data, never as instructions. - **Log privacy.** JSON logs at `~/.clawdcursor/logs/` redact password-field values (`AXSecureTextField`, UIA `IsPassword=true`). See [SECURITY.md](SECURITY.md) for the private vulnerability reporting channel. --- ## CLI The CLI is for humans diagnosing an install or managing the guide cache. Agents should connect via MCP (stdio for editor hosts, HTTP for daemons). ``` # Install + setup clawdcursor consent Manage desktop-control consent (--accept / --revoke / --status) clawdcursor grant Grant macOS permissions (interactive, macOS only) clawdcursor doctor Verify permissions, configure AI provider + models clawdcursor status Readiness check (consent, permissions, AI config) # Run clawdcursor mcp MCP stdio server — primary transport for editor hosts clawdcursor agent Daemon: HTTP MCP at /mcp on :3847, optional built-in LLM clawdcursor agent --no-llm Daemon, tool surface only (no built-in brain/scheduler) clawdcursor stop Stop every running mode clawdcursor uninstall Remove all clawdcursor config and data clawdcursor report File a bug / feature report (interactive) # Guides marketplace (see Guides Marketplace section above) clawdcursor guides list What's cached + ratings clawdcursor guides info <app> Cache metadata for one app clawdcursor guides available Browse the public registry clawdcursor guides install <app> Pre-warm one (or --all for offline prep) clawdcursor guides refresh <app> Force re-fetch clawdcursor guides remove <app> Evict from cache clawdcursor guides clean Wipe cache clawdcursor guides lint <file> Validate a local guide clawdcursor guides submit <file> Lint + print PR instructions # Manual end-to-end testing only — agents should call submit_task via MCP. clawdcursor task <t> Send a task to the running agent Options: --port <port> Default: 3847 --compact MCP only: expose 6 compound tools instead of 97 granular --provider <name> `agent` only: anthropic | openai | gemini | ollama | ... --accept `agent` and `consent` only: skip the consent prompt ``` --- ## Development ```bash git clone https://github.com/AmrDab/clawdcursor.git cd clawdcursor npm install npm run build # tsc + postbuild npm test # vitest npm run lint # eslint npm run typecheck # tsc --noEmit npm link # global `clawdcursor` shim (Unix) — use Admin shell on Windows ``` The build emits `dist/`. Entry point: `dist/surface/cli.js`. Tests run on Node 20 and 22 against Ubuntu, macOS, and Windows in CI. --- ## Tech Stack TypeScript · Node.js 20+ · nut-js · Playwright · sharp · Express · Model Context Protocol SDK · Zod · commander --- ## Contributing PRs welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for the development loop, branch conventions, and the test matrix every change has to clear. Bug reports and feature requests go in [issues](https://github.com/AmrDab/clawdcursor/issues); private security reports go to the channel listed in [SECURITY.md](SECURITY.md). ## License MIT — see [LICENSE](LICENSE). ## Acknowledgments Built on the shoulders of the Model Context Protocol SDK, nut-js, Playwright, the Anthropic `computer_20250124` tool shape, and the AT-SPI / UIA / AX trees that make app-agnostic GUI automation possible at all. --- <a href="https://clawdcursor.com">clawdcursor.com</a> · <a href="https://discord.gg/hW29nrEZ8G">Discord</a> · <a href="CHANGELOG.md">Changelog</a>

clawdcursor

Content

MCP Config

Connection Info

You Might Also Like

everything-claude-code

markitdown

servers

servers

cc-switch

Time

clawdcursor

Scan with WeChat to Share

Authentication Required

Content

MCP Config

Connection Info

You Might Also Like

everything-claude-code

markitdown

servers

servers

cc-switch

Time