Skip to content

moeru-ai/auv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

841 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AUV

License

AUV means Application Use Via ....

Think of it as a programmable computer use, without agents.

Install

Install Rust first. This workspace uses Rust 2024 and currently requires the toolchain declared in Cargo.toml.

Install directly from GitHub:

cargo install --git https://github.com/moeru-ai/auv auv-cli
auv --help

After installation, use the auv CLI directly:

auv --help
auv invoke --help

Setup

macOS Permissions

macOS automation needs OS permissions granted to the process that launches AUV, usually your terminal app.

Open System Settings -> Privacy & Security and enable:

Permission Needed for
Accessibility AX tree reads, focused element control, keyboard/pointer automation.
Screen Recording Screenshots, OCR, visual inspection, and evidence capture.
Automation AppleScript/System Events activation flows used by app probes and some drivers.

After changing permissions, restart the terminal process and rerun:

auv permissions check
auv app probe com.apple.TextEdit

Understand AUV

For Cua, agent-browser, and similar computer-use projects, it is common to execute screenshot, read image, click, type, wait, and follow-up verification steps in sequence, then ask LLMs or agents to judge the next move.

flowchart LR
  A[Agent] --> B[screenshot]
  B --> C[read image]
  C --> D[decide next step]
  D --> E[click]
  E --> F[wait]
  F --> G[type]
  G --> H[verify]
  H --> D
Loading

Many of those repeated sequences can be squashed into reusable GUI operations. Opening an app, waiting for readiness, filling a form, and checking the result should be callable as one command instead of spending tokens on the same step-by-step loop every time.

Modern agents often use skills or project instructions to orchestrate tool calls, CLIs, and scripts. But built-in computer-use surfaces, such as OpenAI Computer Use or Claude Computer Use, are still primarily interactive model-tool loops, not scriptable GUI automation libraries.

Tool-call loopRust scripts
β€’ Ran screenshot
  β”” saved screen.png
β€’ Ran read image screen.png
  β”” form is visible
β€’ Ran click "Email"
  β”” clicked
β€’ Ran type "user@example.com"
  β”” typed
β€’ Ran screenshot
  β”” saved after.png
β€’ Ran verify form state
  β”” ready
pub fn open_and_fill_form(
  app: &mut AppSession,
  data: FormData,
) -> AuvResult<OperationResult> {
  app.open()?;
  app.wait_for_ready()?;
  app.fill(data)?;
  app.verify_submitted()
}
β€’ Ran screenshot
  β”” saved page-1.png
β€’ Ran OCR visible rows
  β”” 12 rows
β€’ Ran scroll
  β”” scrolled down
β€’ Ran OCR visible rows
  β”” 10 rows, 4 repeated
β€’ Ran guess when to stop
  β”” uncertain
pub fn scan_visible_rows(
  region: &mut WindowRegion,
) -> AuvResult<ScrollScanArtifact> {
  region.scan_rows_until_stop()
}
β€’ Ran click target
  β”” clicked
β€’ Ran screenshot
  β”” saved after-click.png
β€’ Ran semantic check
  β”” mismatch
β€’ Ran retry manually
  β”” repeated tool loop
pub fn verify_and_retry<F>(
  mut operation: F,
) -> AuvResult<OperationResult>
where
  F: FnMut() -> AuvResult<OperationResult>,
{
  retry_until_verified(&mut operation)
}

Similar to Playwright, AUV expects agents to write, test, and improve reusable GUI automations for E2E tests and rapid application actions.

AUV is not a computer-use agent. It does not ship an agent or harness. It offers tools, CLIs, drivers, and verifiable observable results so agents can build reusable GUI operations.

AUV is meant to work with coding agents and agent products such as:

That means:

  • If your agent can call a CLI, AUV can be used as computer use.
  • If your agent can write code, AUV can save tokens by moving repeated GUI work into Rust commands today, with JavaScript/TypeScript and Python bindings planned after the contracts settle. Once a GUI flow is finalized as a command, repeated execution can approach zero reasoning-token cost.

Why even build AUV?

AUV born from the grounding knowledge of building general gaming agents for Project AIRI, since 2024, we tried to build agents to allow LLMs to play the following games, you can find how we implement the agents in the following repos:

There are more games we implemented where you can find in Project AIRI organization, but these four requires YOLO, OCR, screen understanding, and computer-use capabilities.

Now you have the framework to build for any applications, games.

Since Vercel published the agent-browser, we fell in love with it and have it assisted agents to build many web projects, but we found that the loop it requires for agents to call agent-browser CLI to execute the commands is too slow and inefficient, while in computer use world, many operations can be repeated thousands of times, just like how Playwright/Vitest would allow us to write E2E test for applications, why don't we expand this idea of writing code to control application to computer use world?

Capability Matrix

What AUV can do, compared to other computer-use projects.

Capability AUV Cua OpenBridge / KWWKComputerUseCore Playwright
Agent model πŸ’‘ BYOA πŸ’‘ BYOA + Built-in Agent πŸ’‘ BYOA + Built-in Agent ❌
Scriptable βœ… Rust ⏳ JS/TS/Python ⚠️ Tools only ⚠️ Swift Only βœ… JS/TS/Python/...
Multi-driver βœ… macOS/Windows ⏳ Linux/Android/iOS βœ… ❌ ❌
CLI βœ… βœ… ❌ ⚠️ via user scripts
MCP βœ… βœ… ❌ ❌
RL Trajectory βœ… runs + o11y (OTEL compatible) + artifacts ⚠️ recordings ❌ βœ…
Screenshot βœ… βœ… βœ… βœ… browser only
OCR βœ… BYOK / OS OCR ⚠️ BYOK ❌ ❌
Image Match βœ… βœ… ❌ ❌ user code only
AX (Accessibility Tree) βœ… macOS/Windows βœ… macOS βœ… macOS ⚠️ Browser only
AX Actions βœ… βœ… βœ… ⚠️ browser only
Mouse / Click βœ… βœ… βœ… ⚠️ Browser only
Virtual Mouse / Background βœ… macOS/Windows βœ… macOS focused βœ… macOS focused ⚠️ Browser only
Virtual Mouse / Foreground HID βœ… βœ… ❌ ⚠️ Browser only
Keyboard βœ… βœ… βœ… ⚠️ Browser Only
Scroll βœ… βœ… βœ… ⚠️ Browser Only
Scroll to List βœ… ❌ ❌ βœ…
Bundled for Apps βœ… ❌ ❌ ❌
Feedback βœ… Agent understand whether clicked or typed ⚠️ tool outputs ⚠️ structured metadata ⚠️ assertions/traces
SLM friendly βœ… Bundled for Apps ⚠️ Agent orchestrated ⚠️ Agent ochestrated βœ…
YOLO / Custom Models βœ… βœ… ❌ ❌
  • Scroll scan is a major reason AUV exists. Most desktop automation stacks can scroll or read a screenshot, but they do not turn a native app's visual list into page records, row candidates, crop artifacts, OCR fragments, and inspectable stop reasons. AUV's current scroll-scan implementation is still contract work, so the old public scan window-region CLI was removed until the reusable API is clear.
  • Feedback means the automation returns machine-readable evidence after an attempt: what input path was used, what changed, what artifacts were captured, whether verification passed, and why an operation should retry, stop, or fail.

Development

cargo fmt --check
cargo check
cargo test
git diff --check
cargo run -- --help
cargo run -- invoke --help

Useful entrypoints:

auv app probe <bundle-id>
auv app analyze .auv/app-probes/<probe>/probe.json
auv invoke <command-id> --help
auv inspect <run-id>

Use docs/TERMS_AND_CONCEPTS.md for shared vocabulary. Durable design and evidence notes live under docs/ai/references/.

Related

Note

This project is part of the Project AIRI ecosystem.

Acknowledgements

Special Thanks

Special thanks to all contributors for their contributions to auv ❀️

Star History

Star History Chart

License

Apache License 2.0

About

πŸ“±πŸ­ No we are not computer use, it's Application Use Via... a unified orchestration layer of OS automation, 0 token cost

Resources

License

Contributing

Stars

Watchers

Forks

Contributors