<h1> Official repo for MotionGPT <img src="https://github.com/OpenMotionLab/MotionGPT_private/assets/16475892/b49667c1-198c-465b-86db-f9c3e9245dcf" width="35px"></h1>
<h2> <a href="https://motion-gpt.github.io/">MotionGPT: Human Motion as a Foreign Language</a></h2>
Project Page • Arxiv Paper • HuggingFace Demo • FAQ • Citation
https://github.com/OpenMotionLab/MotionGPT/assets/120085716/960bf6ed-0cce-4196-8e2c-1a6c5d2aea3a
MotionGPT is a unified and user-friendly motion-language model to learn the semantic coupling of two modalities and generate high-quality motions and text descriptions on multiple motion tasks.
Technical details
Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this “motion vocabulary”, we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.
Setup and download
conda create python=3.10 --name mgpt
conda activate mgpt
Install the packages in requirements.txt and install PyTorch 2.0
pip install -r requirements.txt
We test our code on Python 3.10.6 and PyTorch 2.0.0.
Run the script to download dependencies materials:
bash prepare/download_smpl_model.sh
bash prepare/prepare_t5.sh
For Text to Motion Evaluation
bash prepare/download_t2m_evaluators.sh
Run the script to download the pre-train model
bash prepare/download_pretrained_models.sh
Visit the Google Driver to download the previous dependencies.
Visit the Hugging Face to download the pretrained models.
Webui
Run the following script to launch webui, then visit 0.0.0.0:8888:
python app.py
Batch demo
We support txt file input, the output motions are npy files and output texts are txt files. Please check the configs/assets.yaml for path config, TEST.FOLDER as output folder.
Then, run the following script:
python demo.py --cfg ./configs/config_h3d_stage3.yaml --example ./demos/t2m.txt
Some parameters:
--example=./demo/t2m.txt: input file as text prompts--task=t2m: evaluation tasks including t2m, m2t, pred, inbetweenThe outputs:
npy file: the generated motions with the shape of (nframe, 22, 3)txt file: the input text prompt or text outputTraining guidance
Please refer to HumanML3D for text-to-motion dataset setup.
Put the instructions data in prepare/instructions to the same folder of HumanML3D dataset.
Please first check the parameters in configs/config_h3d_stage1.yaml, e.g. NAME,DEBUG.
Then, run the following command:
python -m train --cfg configs/config_h3d_stage1.yaml --nodebug
Please update the parameters in configs/config_h3d_stage2.yaml, e.g. NAME,DEBUG,PRETRAINED_VAE (change to your latest ckpt model path in previous step)
Then, run the following command:
python -m train --cfg configs/config_h3d_stage2.yaml --nodebug
Please update the parameters in configs/config_h3d_stage3.yaml, e.g. NAME,DEBUG,PRETRAINED (change to your latest ckpt model path in previous step)
Then, run the following command to store all motion tokens of training set for convenience
python -m scripts.get_motion_code --cfg configs/config_h3d_stage3.yaml
After that, run the following command:
python -m train --cfg configs/config_h3d_stage3.yaml --nodebug
Please first put the tained model checkpoint path to TEST.CHECKPOINT in configs/config_h3d_stage3.yaml.
Then, run the following command:
python -m test --cfg configs/config_h3d_stage3.yaml --task t2m
Some parameters:
--task: evaluation tasks including t2m(Text-to-Motion), m2t(Motion translation), pred(Motion prediction), inbetween(Motion inbetween)Due to the python package conflit, the released implement of linguistic metrics in motion translation task is by nlg-metricverse, which may not be consistent to the results implemented by nlg-eval. We will fix this in the future.
Render SMPL
Refer to TEMOS-Rendering motions for blender setup, then install the following dependencies.
YOUR_BLENDER_PYTHON_PATH/python -m pip install -r prepare/requirements_render.txt
Run the following command using blender:
YOUR_BLENDER_PATH/blender --background --python render.py -- --cfg=./configs/render.yaml --dir=YOUR_NPY_FOLDER --mode=video --joint_type=HumanML3D
python -m fit --dir YOUR_NPY_FOLDER --save_folder TEMP_PLY_FOLDER --cuda
This outputs:
mesh npy file: the generate SMPL vertices with the shape of (nframe, 6893, 3)ply files: the ply mesh file for blender or meshlabRun the following command to render SMPL using blender:
YOUR_BLENDER_PATH/blender --background --python render.py -- --cfg=./configs/render.yaml --dir=YOUR_NPY_FOLDER --mode=video --joint_type=HumanML3D
optional parameters:
--mode=video: render mp4 video--mode=sequence: render the whole motion in a png image.Question-and-Answer
The motivation of MotionGPT.
Answer: We present MotionGPT to address various human motion-related tasks within one single unified model, by unifying motion modeling with language through a shared vocabulary. To train this unified model, we propose an instructional training scheme under the protocols for multiple motion-language, which further reveals the potential of Large Language Models (LLMs) in motion tasks beyond the success of language generation. However, it is non-trivial for this combination since it needs to model and generate two distinct modes from scratch. Contrary to the previous work leveraging CLIP to extract text embedding as motion generation conditions, like T2M-GPT, MotionGPT introduces the motion-language pre-training on LLM so it can leverage the strong language generation and zero-shot transfer abilities of pre-trained language models, as well as generates human language and motion in a unified model.
Instruction tuning and zero-shot learning.
Answer: We propose instruction tuning to train a single MotionGPT across all motion-related tasks, while task-specific tuning is to train and evaluate MotionGPTs on a single task. We employ these two training schemes to study the ability of MotionGPT across multi-tasks. As shown in this figure, we provide zero-shot cases. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like "scuttling" and "barriers", and generate correct motions based on the meaning of sentences. However, it still struggles to generate unseen motions, like gymnastics, even if MotionGPTs understand the text inputs.
In view of the recent success of LLMs, MotionGPT should pay attention to unifying current available datasets to exploit the scalable potential of language models when processing large-scale data besides increasing model size.
Answer: We have faced this limited dataset issue while implementing MotionGPT and in our further research. It is a hard but valuable work to unify and collect a larger motion dataset. Fortunately, some researchers are working on this problem, as seen in recent work like Motion-X and other datasets, which hold promise for advancing large-scale motion models. We intend to further evaluate MotionGPT on these larger datasets once they become available.
How well MotionGPT learns the relationship between motion and language?
Answer: Unlike the previous motion generators using the text encoder of CLIP for conditions, please note that MotionGPTs leverage language models to learn the motion-language relationship, instead of relying on text features from CLIP. According to our zero-shot results (cf. Fig. 12) and performances on multi-tasks (cf. Fig. 10), MotionGPTs establish robust connections between simple/complex texts and simple motions in evaluations, but they fall short when it comes to complex-text to complex motion translation.
Why choose T5, an encoder-decoder architecture, as the base model? How about a decoder-only model, like LLaMA?
Answer: The first language model that we used to build MotionGPTs is LLaMA-13B. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA. We tried a smaller size decoder-only backbone GPT2-Medium and provide the results in Tab. 15. Then, we thus chose T5-770M, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like Unified-IO and BLIP, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.
How to merge the text vocab and motion vocab in detail? concatenating them together?
Answer: To ensure a shared distribution between language and motion, we initialize the motion tokens separately and concatenate them alongside the language tokens. This step ensures a balanced representation that encompasses both modalities. Besides the token embeddings are actively trained during the entirety of stages 2 and 3, ensuring a comprehensive fusion of language and motion knowledge.
For tuning on each task, tune the entire model or just part of it?
Answer: To address individual tasks, we adopt a focused approach where the entire model is fine-tuned. Our rationale lies in the fact that, for each specific task, our emphasis is on optimizing task-specific per
$ claude mcp add MotionGPT \
-- python -m otcore.mcp_server <graph>