MCP Hub
Back to servers

Desktop Automation

A Windows automation server that leverages UI Automation APIs to perform window management, element interaction, and input control with high token efficiency compared to vision-based methods.

Stars
1
Tools
15
Updated
Jan 3, 2026
Validated
Jan 11, 2026

Desktop MCP Server

A Windows desktop automation MCP (Model Context Protocol) server that uses UI Automation instead of screenshots for efficient token usage.

Features

  • UI Automation-based - Reads actual UI element text instead of screenshots (~50-100x more token efficient)
  • Smart element finding - Find and click elements by text content
  • Window management - List, focus, and interact with windows
  • Full input control - Mouse, keyboard, scrolling, hotkeys

Installation

pip install -r requirements.txt

Usage

With Gemini CLI

Add to ~/.gemini/settings.json:

{
  "mcpServers": {
    "desktop-controller": {
      "command": "python",
      "args": ["C:\\Users\\Derek\\Documents\\github\\desktop-mcp-server\\desktop_server.py"]
    }
  }
}

Standalone

python desktop_server.py

Available Tools

UI Reading (Use these first!)

ToolDescription
get_window_text_content()Gets all readable text from active window with clickable coordinates
get_active_window()Get info about the focused window
list_all_windows()List all open windows
find_element(text)Search for UI elements by text

Smart Interaction

ToolDescription
click_element(text)Click an element by its text (recommended over coordinates)
focus_window(title)Bring a window to foreground by title

Mouse Control

ToolDescription
click_mouse(x, y)Click at coordinates
double_click(x, y)Double-click
move_mouse(x, y)Move cursor
drag_mouse(x, y)Drag to position
scroll(clicks)Scroll mouse wheel

Keyboard Control

ToolDescription
type_text(text)Type a string
press_key(key)Press a single key
hotkey(keys)Press key combination

Fallback

ToolDescription
take_screenshot_region(x, y, w, h)Screenshot a small region (for images/games only)

Why UI Automation?

Screenshot-basedUI Automation
~1-5MB base64 per call~2-10KB structured data
~50,000+ tokens~500-2000 tokens
Requires vision AIJust text/JSON
Coordinates via AI guessingExact click coordinates provided

Requirements

  • Windows 10/11
  • Python 3.10+

Reviews

No reviews yet

Sign in to write a review