hub / github.com/microsoft/Magma

github.com/microsoft/Magma @main sqlite

repository ↗ · DeepWiki ↗

556 symbols 1,947 edges 77 files 143 documented · 26%

README

🤖 Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang^*¹^† Reuben Tan¹^† Qianhui Wu¹^† Ruijie Zheng²^‡ Baolin Peng¹^‡ Yongyuan Liang²^‡

Yu Gu¹ Mu Cai³ Seonghyeon Ye⁴ Joel Jang⁵ Yuquan Deng⁵ Lars Liden¹ Jianfeng Gao¹^▽

¹ Microsoft Research; ² University of Maryland; ³ University of Wisconsin-Madison
⁴ KAIST; ⁵ University of Washington

^* Project lead ^† First authors ^‡ Second authors ^▽ Leadership

CVPR 2025

📄 arXiv Paper 🌐 Project Page 🤗 Hugging Face Model ☁️ Azure AI Foundry 📺 Video

The Path Towards Multimodal AI Agents

:sparkles: Highlights

Digital and Physical Worlds: Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
Versatile Capabilities: Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
State-of-the-art Performance: Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
Scalable Pretraining Strategy: Magma is designed to be learned scalably from unlabeled videos in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!

:fire: News

[2025.04.29] Mind2Web and AITW with SoM prompting annotations are released on hugging face! We used them for our Magma downstream finetuning and reported the results in our table.
[2025.04.12] 🔥We released the pretraining videos with visual traces on hugging face Magma-Video-ToM.
[2025.04.06] Open X-Embodiment pretraining data with visual traces can be downloaded from Magma-OXE-ToM.
[2025.03.16] We released the demo code for generating SoM and ToM for instructional videos (i.e., Alg. 2 in our paper) in SoM/ToM Generation.
[2025.03.09] 🔥 We released Magma training code, and an exampler for training Magma-8B on Magma-820K dataset. Check out the Model Training
[2025.03.06] We released a new demo for showing robot planning capabilities. Run python agents/robot_traj/app.py to start the demo!
[2025.02.28] We released two demos, Magma-UI and Magma-Gaming on Hugging Face. Check out our model's action grounding and planning capabilities!
[2025.02.26] ⭐ Exciting News! Magma got accepted by CVPR 2025!
[2025.02.25] 🎉 Big News! We are releasing the Magma model on Hugging Face and Azure AI Foundry!
[2025.02.23] We released the Magma Inference code!
[2025.02.20] Magma has reached the top spot on Hacker News!
[2025.02.19] We will be releasing our code, model and UI navigation demo by MSR Forum on 02.25 next Tuesday!
[2025.02.18] Our Flagship Project Magma at MSR is released on arXiv!

:bookmark_tabs: Todos

We will be releasing all the following contents: - [x] Model inference code - [x] Add UI and Gaming agent Demos - [x] Model checkpoint - [x] Training code - [x] Open-XE pretraining data with traces - [x] Video pretraining data with traces - [ ] SeeClick and Vision2UI pretraining data with SoM - [ ] UI/Libero finetuning script - [ ] Video finetune script

:clipboard: Outline

What is Magma?
How we pretrain Magma?
Installation
Data Preprocessing
SoM and ToM Generation
Model Training
Pretraining on Open-X without SoM/ToM
Finetuning on Magma-820K
Model Usage
Inference
Evaluation with lmms-eval
Evaluation with SimplerEnv
Multi-images or Video
API Server
Agent Demos
Citation
Acknowledgements

What is Magma?

Magma is a foundation model for multimodal AI agents. As the bedrock for multimodal agentic models, it should possess strong capabilities to perceive the multimodal world AND takes goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:

Verbal and spatial-temporal intelligence: Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.
Digital and physical world: Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.

With this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.

How we pretrain Magma?

We pursue the goal through two dimensions:

Large-scale heterogeneous training data: we curate a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into "action" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizon action prediction and planning.
Universal pretraining objectives: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.

Installation

Clone this repo to your local machine:

git clone https://github.com/microsoft/Magma
cd Magma

Install the dependencies:

conda create -n magma python=3.10 -y
conda activate magma
pip install --upgrade pip
pip install -e .

Install packages for training:

pip install -e ".[train]"

Install packages for agents:

pip install -e ".[agent]"

Other probably needed packages:
Co-tracker

# Install co-tracker
git clone https://github.com/facebookresearch/co-tracker
cd co-tracker
pip install -e .
pip install imageio[ffmpeg]
cd ../

Kmeans

# Install kmeans_pytorch, note: install with pip will leads to error
git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorch
pip install -e .
cd ../

Misc

# Install others packages
pip install ipython
pip install faiss-cpu
pip install decord

⚠️ Please make sure you have installed the transformers with correct version (>=4.49.0). If you see some abnormal behavior, please check the version of transformers, and probably see below for the customized transformers.

Click to expand

Customized Transformers

⚠️ One important thing to note is that our model uses ConvNext as the backbone, which contains a layer scaler parameter gamma. This leads to a bug of Transformers library as it automatically replace the 'gamma' with 'weight' when loading the model. To fix this, we need to modify the 'transformers/models/auto/modeling_auto.py' file as follows:

if "gamma" in key and "clip_vision_model" not in key:
    key = key.replace("gamma", "weight")

This bug still exists in the latest transformer version. So please make sure you install the following bug-free customized version of transformers as lised in pyproject.toml:

pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.44.1

or the newest version:

pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2

Data Preprocessing

SoM and ToM Generation

As shown in Table 1 of our paper, we apply SoM and ToM on both robotics data and instructional videos. To ensure reproducibility, we provide the code to generate SoM and ToM for instructional videos. The code is located in tools/som_tom/demo.py. You can run the following command to generate SoM and ToM for the robotics data:

python tools/som_tom/demo.py

And then you can find two videos in the tools/som_tom/videos folder. The original trace extracted from CoTracker is shown in orig_trace.mp4, and the SoM-ToM video is named som_tom.mp4.

Model Training

We provide the instructions to pretrain LLama-3-8B-Instruct on Open-X-Embodiment and finetune Magma-8B on different downstream tasks.

Pretraining on Open-X without SoM/ToM

Data Preparation

Download Open-X-Embodiment from the official site. Then edit the data config file openx.yaml accordingly. The data config file should look like this:

# a list of all the data paths
DATA_PATH: 
  - "/path/to/open-x"
IMAGE_FOLDER:
  - "siglip-224px+mx-oxe-magic-soup"    
LANGUAGE_PATH:
  - ""

Pretrain on OpenX

Once set up the dataset and config, you can run the following command to finetune the model:

sh scripts/pretrain/pretrain_openx.sh

Benefit: We spent tremendous effort to decouple the Open-X dataloader from OpenVLA and make it compatible with other datasets used in our experiments*

Finetuning on Magma-820K

Data Preparation

Download annotation file from MagmaAI/Magma-820K. Please prepare the image data according to the dataset list in the dataset page. Once finished, please edit magma_820k.yaml file accordingly.

# a list of all the data paths
DATA_PATH: 
  - "/path/to/magma_820k.json"
IMAGE_FOLDER:
  - "/root/to/magma_820k/images"

Finetune from Magma-8B

Once set up the dataset and config, you can run the following command to finetune the model:

Core symbols most depended-on inside this repo

flatten

called by 18

tools/lmms-eval-magma/magma.py

box_area

called by 18

agents/ui_agent/util/utils.py

_get_frame

called by 14

data/conversations.py

invert_gripper_actions

called by 13

data/openx/datasets/rlds/utils/data_utils.py

_construct_conv_som

called by 12

data/conversations.py

decode

called by 11

magma/processing_magma.py

get_text_size

called by 10

data/utils/som_tom.py

batch_decode

called by 8

magma/processing_magma.py

Shape

Function 233

Method 190

Class 129

Route 4

Languages

Python100%

Modules by API surface

data/openx/datasets/rlds/oxe/transforms.py58 symbols

data/openx/conf/models.py53 symbols

magma/modeling_magma.py39 symbols

agents/ui_agent/util/utils.py20 symbols

tools/lmms-eval-magma/magma.py18 symbols

data/openx/datasets/datasets.py17 symbols

trainer/trainer.py16 symbols

train.py15 symbols

data/openx/datasets/rlds/utils/data_utils.py14 symbols

data/openx/conf/vla.py14 symbols

agents/ui_agent/util/som.py14 symbols

data/dataset.py13 symbols

Dependencies from manifests, versioned

accelerate0.34.2 · 1×

bddl1.0.1 · 1×

bitsandbytes0.44.1 · 1×

easydict1.9 · 1×

gym0.25.2 · 1×

peft0.4.0 · 1×

pydantic2.0 · 1×

pytorch-lightning1.0.8 · 1×

robosuite1.4.0 · 1×

sentencepiece0.1.99 · 1×

shortuuid1×

tokenizers0.15.0 · 1×

For agents

$ claude mcp add Magma \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact