OthersideAI/self-operating-computer
Fork: 1229 Star: 9117 (更新于 2025-01-15 01:44:11)
license: MIT
Language: Python .
A framework to enable multimodal models to operate a computer.
最后发布版本: v1.4.6 ( 2024-07-10 00:14:50)
Self-Operating Computer Framework
A framework to enable multimodal models to operate a computer.
Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.
Key Features
- Compatibility: Designed for various multimodal models.
- Integration: Currently integrated with GPT-4o, Gemini Pro Vision, Claude 3 and LLaVa.
- Future Plans: Support for additional models.
Ongoing Development
At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.
Agent-1-Vision Model API Access
We will soon be offering API access to our Agent-1-Vision model.
If you're interested in gaining access to this API, sign up here.
Demo
Run Self-Operating Computer
- Install the project
pip install self-operating-computer
- Run the project
operate
-
Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here. If you need you change your key at a later point, run
vim .env
to open the.env
and replace the old key.
- Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".
Using operate
Modes
Multimodal Models -m
An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision
by following the instructions below.
Start operate
with the Gemini model
operate -m gemini-pro-vision
Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
Try Claude -m claude-3
Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the Claude dashboard to get an API key and run the command below to try it.
operate -m claude-3
Try LLaVa Hosted Through Ollama -m llava
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
Note: Ollama currently only supports MacOS and Linux
First, install Ollama on your machine from https://ollama.ai/download.
Once Ollama is installed, pull the LLaVA model:
ollama pull llava
This will download the model on your machine which takes approximately 5 GB of storage.
When Ollama has finished pulling LLaVA, start the server:
ollama serve
That's it! Now start operate
and select the LLaVA model:
operate -m llava
Important: Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.
Learn more about Ollama at its GitHub Repository
Voice Mode --voice
The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:
git clone https://github.com/OthersideAI/self-operating-computer.git
Cd into directory:
cd self-operating-computer
Install the additional requirements-audio.txt
pip install -r requirements-audio.txt
Install device requirements For mac users:
brew install portaudio
For Linux users:
sudo apt install portaudio19-dev python3-pyaudio
Run with voice mode
operate --voice
Optical Character Recognition Mode -m gpt-4-with-ocr
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the gpt-4-with-ocr
mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to click
elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
Based on recent tests, OCR performs better than som
and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
operate
or operate -m gpt-4-with-ocr
will also work.
Set-of-Mark Prompting -m gpt-4-with-som
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som
command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
Learn more about SoM Prompting in the detailed arXiv paper: here.
For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt
file is included under model/weights/
. Users are encouraged to swap in their best.pt
file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
Start operate
with the SoM model
operate -m gpt-4-with-som
Contributions are Welcomed!:
If you want to contribute yourself, see CONTRIBUTING.md.
Feedback
For any input on improving this project, feel free to reach out to Josh on Twitter.
Join Our Discord Community
For real-time discussions and community support, join our Discord server.
- If you're already a member, join the discussion in #self-operating-computer.
- If you're new, first join our Discord Server and then navigate to the #self-operating-computer.
Follow HyperWriteAI for More Updates
Stay updated with the latest developments:
Compatibility
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).
OpenAI Rate Limiting Note
The gpt-4o
model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here
最近版本更新:(数据更新于 2024-09-30 15:49:44)
2024-07-10 00:14:50 v1.4.6
2024-03-21 22:44:59 v1.4.5
2024-03-20 23:25:17 v1.4.2
2024-03-20 22:56:12 v1.4.1
2024-03-20 22:49:58 v1.4.0
2024-02-17 09:15:28 v1.3.2
2024-02-10 04:31:41 v1.3.1
2024-02-09 13:28:51 v1.3.0
2024-02-03 06:47:04 v1.2.9
2024-01-26 00:30:43 v1.2.8
主题(topics):
automation, openai, pyautogui
OthersideAI/self-operating-computer同语言 Python最近更新仓库
2025-01-18 21:26:31 sunnypilot/sunnypilot
2025-01-17 23:34:10 Skyvern-AI/skyvern
2025-01-17 19:49:33 ultralytics/ultralytics
2025-01-17 19:12:03 XiaoMi/ha_xiaomi_home
2025-01-17 08:27:45 comfyanonymous/ComfyUI
2025-01-17 04:56:19 QuivrHQ/MegaParse