hub / github.com/OthersideAI/self-operating-computer

github.com/OthersideAI/self-operating-computer @v1.5.8 sqlite

repository ↗ · DeepWiki ↗ · release v1.5.8 ↗

58 symbols 282 edges 18 files 16 documented · 28%

README

ome

Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of using a multimodal model to view the screen and operate a computer.

Key Features

Compatibility: Designed for various multimodal models.
Integration: Currently integrated with GPT-4o, o1, Gemini Pro Vision, Claude 3, Qwen-VL and LLaVa.
Future Plans: Support for additional models.

Demo

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

Run `Self-Operating Computer`

Install the project

pip install self-operating-computer

Run the project

operate

Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here. If you need you change your key at a later point, run vim .env to open the .env and replace the old key.

Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Using `operate` Modes

OpenAI models

The default model for the project is gpt-4o which you can use by simply typing operate. To try running OpenAI's new o1 model, use the command below.

operate -m o1-with-ocr

Multimodal Models `-m`

Try Google's gemini-pro-vision by following the instructions below. Start operate with the Gemini model

operate -m gemini-pro-vision

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

Try Claude `-m claude-3`

Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.

operate -m claude-3

Try qwen `-m qwen-vl`

Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Qwen dashboard to get an API key and run the command below to try it.

operate -m qwen-vl

Try LLaVa Hosted Through Ollama `-m llava`

If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux. Windows now in Preview

First, install Ollama on your machine from https://ollama.ai/download.

Once Ollama is installed, pull the LLaVA model:

ollama pull llava

This will download the model on your machine which takes approximately 5 GB of storage.

When Ollama has finished pulling LLaVA, start the server:

ollama serve

That's it! Now start operate and select the LLaVA model:

operate -m llava

Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

Learn more about Ollama at its GitHub Repository

Voice Mode `--voice`

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

operate --voice

Optical Character Recognition Mode `-m gpt-4-with-ocr`

The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.

Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:

operate or operate -m gpt-4-with-ocr will also work.

Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start operate with the SoM model

operate -m gpt-4-with-som

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server. - If you're already a member, join the discussion in #self-operating-computer. - If you're new, first join our Discord Server and then navigate to the #self-operating-computer.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments: - Follow HyperWriteAI on Twitter. - Follow HyperWriteAI on LinkedIn.

Compatibility

This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4o model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
Learn more here

Core symbols most depended-on inside this repo

capture_screen_with_cursor

called by 8

operate/utils/screenshot.py

clean_json

called by 8

operate/models/apis.py

supports_ansi

called by 7

operate/utils/style.py

get_user_prompt

called by 7

operate/models/prompts.py

get_user_first_message_prompt

called by 7

operate/models/prompts.py

call_gpt_4o

called by 5

operate/models/apis.py

confirm_system_prompt

called by 5

operate/models/apis.py

initialize_openai

called by 4

operate/config.py

Shape

Function 38

Method 17

Class 3

Languages

Python100%

Modules by API surface

operate/models/apis.py13 symbols

operate/config.py12 symbols

evaluate.py7 symbols

operate/utils/operating_system.py5 symbols

operate/utils/label.py5 symbols

operate/models/prompts.py3 symbols

operate/exceptions.py3 symbols

operate/utils/screenshot.py2 symbols

operate/utils/ocr.py2 symbols

operate/utils/misc.py2 symbols

operate/operate.py2 symbols

operate/utils/style.py1 symbols

Dependencies from manifests, versioned

EasyProcess1.1 · 1×

MouseInfo0.1.3 · 1×

Pillow10.1.0 · 1×

PyAutoGUI0.9.54 · 1×

PyGetWindow0.0.9 · 1×

PyMsgBox1.0.9 · 1×

PyRect0.2.0 · 1×

PyScreeze0.1.29 · 1×

aiohttp3.9.1 · 1×

annotated-types0.6.0 · 1×

anyio3.7.1 · 1×

certifi2023.7.22 · 1×

For agents

$ claude mcp add self-operating-computer \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/OthersideAI/self-operating-computer @v1.5.8 sqlite

Self-Operating Computer Framework

Key Features

Demo

Run Self-Operating Computer

Using operate Modes

OpenAI models

Multimodal Models -m

Try Claude -m claude-3

Try qwen -m qwen-vl

Try LLaVa Hosted Through Ollama -m llava

Voice Mode --voice

Optical Character Recognition Mode -m gpt-4-with-ocr

Set-of-Mark Prompting -m gpt-4-with-som

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents

Run `Self-Operating Computer`

Using `operate` Modes

Multimodal Models `-m`

Try Claude `-m claude-3`

Try qwen `-m qwen-vl`

Try LLaVa Hosted Through Ollama `-m llava`

Voice Mode `--voice`

Optical Character Recognition Mode `-m gpt-4-with-ocr`

Set-of-Mark Prompting `-m gpt-4-with-som`