MCP Hub
Back to servers

computeruseprotocol

Computer Use Protocol is a universal schema for AI agents to perceive and interact with any desktop UI.

GitHub
Stars
5
Updated
Feb 27, 2026
Validated
Mar 1, 2026

Computer Use Protocol

A universal protocol for AI agents to perceive and interact with any desktop UI.

npm version PyPI version MIT License Website


Computer Use Protocol is a universal schema for representing UI accessibility trees, one format that works identically across Windows, macOS, Linux, Web, Android, and iOS. It includes a compact text encoding optimized for LLM context windows (~97% smaller than JSON), making it ideal for AI agents that need to perceive and act on desktop UIs. This repository is that core: the JSON schema, the compact text format, the cross-platform role/state/action mappings, and documentation.

CUP also provides SDKs for capturing and interacting with native UI trees, and MCP servers for exposing those capabilities directly to AI agents like Claude and Copilot.

Why CUP?

Every platform exposes UI accessibility differently. Windows uses UIA with ~40 ControlTypes, macOS has AXUIElement with its own role system, Linux uses AT-SPI2 with 100+ roles, and the web has ~80 ARIA roles.

Today, every agent framework reinvents this translation layer independently. CUP solves it once at the representation level:

  • One format everywhere - write agent logic once, run it on any platform
  • Built for LLMs - compact encoding fits complex UIs into context windows at ~15x fewer tokens than the next closest format
  • Built for actions - 15 canonical verbs that map to native platform APIs
  • No information loss - raw native properties preserved via node.platform.*

Schema

CUP defines a JSON envelope format built on ARIA-derived roles:

{
    "version": "0.1.0",
    "platform": "windows",
    "timestamp": 1740067200000,
    "screen": { "w": 2560, "h": 1440, "scale": 1.0 },
    "app": { "name": "Spotify", "pid": 1234 },
    "tree": [
        {
            "id": "e0",
            "role": "window",
            "name": "Spotify",
            "bounds": { "x": 120, "y": 40, "w": 1680, "h": 1020 },
            "states": ["focused"],
            "actions": ["click"],
            "children": [ ... ]
        }
    ]
}

CUP compact format (~97% token reduction, heavily optimized for CUA/LLMs):

[e0] win "Spotify" 120,40 1680x1020
  [e1] doc "Spotify" 120,40 1680x1020
    [e2] btn "Back" 132,52 32x32 [clk]
    [e3] btn "Forward" 170,52 32x32 {dis} [clk]
    [e7] nav "Main" 120,88 240x972
      [e8] lnk "Home" 132,100 216x40 {sel} [clk]

Full schema: schema/cup.schema.json | Compact format spec: schema/compact.md

Benchmark

Captured UI on a text heavy wikipedia page, using every format.

CUP Format Benchmark

Roles

59 ARIA-derived roles:

alert alertdialog application banner button cell checkbox columnheader combobox complementary contentinfo dialog document form generic grid group heading img link list listitem log main marquee menu menubar menuitem menuitemcheckbox menuitemradio navigation none option progressbar radio region row rowheader scrollbar search searchbox separator slider spinbutton status switch tab table tablist tabpanel text textbox timer titlebar toolbar tooltip tree treeitem window

Role mappings: schema/mappings.json

States

16 state flags (only truthy/active states are listed, absence = default):

busy checked collapsed disabled editable expanded focused hidden mixed modal multiselectable offscreen pressed readonly required selected

Actions

The protocol defines 15 canonical action verbs, the vocabulary for what an agent can do with an element. The protocol specifies the names and semantics; SDKs provide the actual execution against native platform APIs.

ActionParametersDescription
click-Click/invoke the element
collapse-Collapse an expanded element
decrement-Decrement a slider/spinbutton
dismiss-Dismiss a dialog/popup
doubleclick-Double-click
expand-Expand a collapsed element
focus-Move keyboard focus to the element
increment-Increment a slider/spinbutton
longpress-Long-press (touch/mobile interaction)
rightclick-Right-click (context menu)
scrolldirection: strScroll container (up/down/left/right)
select-Select an item in a list/tree/tab
setvaluevalue: strSet element value programmatically
toggle-Toggle checkbox or switch
typevalue: strType text into a field

Session-level actions (not element-scoped):

ActionParametersDescription
press_keyskeys: strSend a keyboard shortcut
waitms: intWait/delay between actions in a batch

SDKs

SDKs implement the protocol. They capture native accessibility trees, normalize them into CUP format, execute actions, and optionally expose everything through MCP servers for AI agent integration.

LanguageRepositoryPackage
Pythonpython-sdkpip install computeruseprotocol
TypeScripttypescript-sdknpm install computeruseprotocol

Building your own SDK? All you need is this spec. Implement tree capture for your target platform, normalize into the CUP schema, and you're compatible with every tool in the ecosystem.

Documentation

Contributing

CUP is in early development (v0.1.0). Contributions to the specification are welcome, especially:

  • New role or action proposals with cross-platform mapping rationale
  • Platform mapping improvements in schema/mappings.json
  • Schema documentation and examples

For SDK contributions (bug fixes, new platform adapters, etc.), see the language-specific repos above.

See CONTRIBUTING.md for guidelines.

License

MIT

Reviews

No reviews yet

Sign in to write a review