ome
A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of using a multimodal model to view the screen and operate a computer.

https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0
Self-Operating Computerpip install self-operating-computer
operate
vim .env to open the .env and replace the old key. 

operate ModesThe default model for the project is gpt-4o which you can use by simply typing operate. To try running OpenAI's new o1 model, use the command below.
operate -m o1-with-ocr
-mTry Google's gemini-pro-vision by following the instructions below. Start operate with the Gemini model
operate -m gemini-pro-vision
Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
-m claude-3Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.
operate -m claude-3
-m qwen-vlUse Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Qwen dashboard to get an API key and run the command below to try it.
operate -m qwen-vl
-m llavaIf you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux. Windows now in Preview
First, install Ollama on your machine from https://ollama.ai/download.
Once Ollama is installed, pull the LLaVA model:
ollama pull llava
This will download the model on your machine which takes approximately 5 GB of storage.
When Ollama has finished pulling LLaVA, start the server:
ollama serve
That's it! Now start operate and select the LLaVA model:
operate -m llava
Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.
Learn more about Ollama at its GitHub Repository
--voiceThe framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:
git clone https://github.com/OthersideAI/self-operating-computer.git
Cd into directory:
cd self-operating-computer
Install the additional requirements-audio.txt
pip install -r requirements-audio.txt
Install device requirements For mac users:
brew install portaudio
For Linux users:
sudo apt install portaudio19-dev python3-pyaudio
Run with voice mode
operate --voice
-m gpt-4-with-ocrThe Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
Based on recent tests, OCR performs better than som and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
operate or operate -m gpt-4-with-ocr will also work.
-m gpt-4-with-somThe Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
Learn more about SoM Prompting in the detailed arXiv paper: here.
For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
Start operate with the SoM model
operate -m gpt-4-with-som
If you want to contribute yourself, see CONTRIBUTING.md.
For any input on improving this project, feel free to reach out to Josh on Twitter.
For real-time discussions and community support, join our Discord server. - If you're already a member, join the discussion in #self-operating-computer. - If you're new, first join our Discord Server and then navigate to the #self-operating-computer.
Stay updated with the latest developments: - Follow HyperWriteAI on Twitter. - Follow HyperWriteAI on LinkedIn.
The gpt-4o model is required. To unlock access to this model, your account needs to spend at least \$5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum \$5.
Learn more here
$ claude mcp add self-operating-computer \
-- python -m otcore.mcp_server <graph>