MCPcopy Index your code
hub / github.com/fudan-generative-vision/hallo

github.com/fudan-generative-vision/hallo @v1.0.0

repository ↗ · DeepWiki ↗ · release v1.0.0 ↗ · + Follow
289 symbols 1,011 edges 35 files 199 documented · 69%
README

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

<a href='https://github.com/xumingw' target='_blank'>Mingwang Xu</a><sup>1*</sup>&emsp;
<a href='https://github.com/crystallee-ai' target='_blank'>Hui Li</a><sup>1*</sup>&emsp;
<a href='https://github.com/subazinga' target='_blank'>Qingkun Su</a><sup>1*</sup>&emsp;
<a href='https://github.com/NinoNeumann' target='_blank'>Hanlin Shang</a><sup>1</sup>&emsp;
<a href='https://github.com/AricGamma' target='_blank'>Liwei Zhang</a><sup>1</sup>&emsp;
<a href='https://github.com/cnexah' target='_blank'>Ce Liu</a><sup>3</sup>&emsp;






<a href='https://jingdongwang2017.github.io/' target='_blank'>Jingdong Wang</a><sup>2</sup>&emsp;
<a href='https://yoyo000.github.io/' target='_blank'>Yao Yao</a><sup>4</sup>&emsp;
<a href='https://sites.google.com/site/zhusiyucs/home' target='_blank'>Siyu Zhu</a><sup>1</sup>&emsp;







<sup>1</sup>Fudan University&emsp; <sup>2</sup>Baidu Inc&emsp; <sup>3</sup>ETH Zurich&emsp; <sup>4</sup>Nanjing University










<a href='https://github.com/fudan-generative-vision/hallo'><img src='https://img.shields.io/github/stars/fudan-generative-vision/hallo?style=social'></a>
<a href='https://fudan-generative-vision.github.io/hallo/#/'><img src='https://img.shields.io/badge/Project-HomePage-Green'></a>
<a href='https://arxiv.org/pdf/2406.08801'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href='https://huggingface.co/spaces/fudan-generative-ai/hallo'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a>
<a href='https://huggingface.co/fudan-generative-ai/hallo'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow'></a>
<a href='https://www.modelscope.cn/models/fudan-generative-vision/Hallo/summary'><img src='https://img.shields.io/badge/Modelscope-Model-purple'></a>
<a href='assets/wechat.jpeg'><img src='https://badges.aleen42.com/src/wechat.svg'></a>

📸 Showcase

https://github.com/fudan-generative-vision/hallo/assets/17402682/9d1a0de4-3470-4d38-9e4f-412f517f834c

🎬 Honoring Classic Films

Devil Wears Prada Green Book Infernal Affairs
Patch Adams Tough Love Shawshank Redemption

Explore more examples.

📰 News

  • 2024/06/28: 🎉🎉🎉 We are proud to announce the release of our model training code. Try your own training data. Here is tutorial.
  • 2024/06/21: 🚀🚀🚀 Cloned a Gradio demo on 🤗Huggingface space.
  • 2024/06/20: 🌟🌟🌟 Received numerous contributions from the community, including a Windows version, ComfyUI, WebUI, and Docker template.
  • 2024/06/15: ✨✨✨ Released some images and audios for inference testing on 🤗Huggingface.
  • 2024/06/15: 🎉🎉🎉 Launched the first version on 🫡GitHub.

🤝 Community Resources

Explore the resources developed by our community to enhance your experience with Hallo:

Thanks to all of them.

Join our community and explore these amazing resources to make the most out of Hallo. Enjoy and elevate their creative projects!

🔧️ Framework

abstract framework

⚙️ Installation

  • System requirement: Ubuntu 20.04/Ubuntu 22.04, Cuda 12.1
  • Tested GPUs: A100

Create conda environment:

  conda create -n hallo python=3.10
  conda activate hallo

Install packages with pip

  pip install -r requirements.txt
  pip install .

Besides, ffmpeg is also needed:

  apt-get install ffmpeg

🗝️️ Usage

The entry point for inference is scripts/inference.py. Before testing your cases, two preparations need to be completed:

  1. Download all required pretrained models.
  2. Prepare source image and driving audio pairs.
  3. Run inference.

📥 Download Pretrained Models

You can easily get all pretrained models required by inference from our HuggingFace repo.

Clone the pretrained models into ${PROJECT_ROOT}/pretrained_models directory by cmd below:

git lfs install
git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models

Or you can download them separately from their source repo:

Finally, these pretrained models should be organized as follows:

./pretrained_models/
|-- audio_separator/
|   |-- download_checks.json
|   |-- mdx_model_data.json
|   |-- vr_model_data.json
|   `-- Kim_Vocal_2.onnx
|-- face_analysis/
|   `-- models/
|       |-- face_landmarker_v2_with_blendshapes.task  # face landmarker model from mediapipe
|       |-- 1k3d68.onnx
|       |-- 2d106det.onnx
|       |-- genderage.onnx
|       |-- glintr100.onnx
|       `-- scrfd_10g_bnkps.onnx
|-- motion_module/
|   `-- mm_sd_v15_v2.ckpt
|-- sd-vae-ft-mse/
|   |-- config.json
|   `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5/
|   `-- unet/
|       |-- config.json
|       `-- diffusion_pytorch_model.safetensors
`-- wav2vec/
    `-- wav2vec2-base-960h/
        |-- config.json
        |-- feature_extractor_config.json
        |-- model.safetensors
        |-- preprocessor_config.json
        |-- special_tokens_map.json
        |-- tokenizer_config.json
        `-- vocab.json

🛠️ Prepare Inference Data

Hallo has a few simple requirements for input data:

For the source image:

  1. It should be cropped into squares.
  2. The face should be the main focus, making up 50%-70% of the image.
  3. The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

For the driving audio:

  1. It must be in WAV format.
  2. It must be in English since our training datasets are only in this language.
  3. Ensure the vocals are clear; background music is acceptable.

We have provided some samples for your reference.

🎮 Run Inference

Simply to run the scripts/inference.py and pass source_image and driving_audio as input:

python scripts/inference.py --source_image examples/reference_images/1.jpg --driving_audio examples/driving_audios/1.wav

Animation results will be saved as ${PROJECT_ROOT}/.cache/output.mp4 by default. You can pass --output to specify the output file name. You can find more examples for inference at examples folder.

For more options:

usage: inference.py [-h] [-c CONFIG] [--source_image SOURCE_IMAGE] [--driving_audio DRIVING_AUDIO] [--output OUTPUT] [--pose_weight POSE_WEIGHT]
                    [--face_weight FACE_WEIGHT] [--lip_weight LIP_WEIGHT] [--face_expand_ratio FACE_EXPAND_RATIO]

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
  --source_image SOURCE_IMAGE
                        source image
  --driving_audio DRIVING_AUDIO
                        driving audio
  --output OUTPUT       output video file name
  --pose_weight POSE_WEIGHT
                        weight of pose
  --face_weight FACE_WEIGHT
                        weight of face
  --lip_weight LIP_WEIGHT
                        weight of lip
  --face_expand_ratio FACE_EXPAND_RATIO
                        face region

Training

Prepare Data for Training

The training data, which utilizes some talking-face videos similar to the source images used for inference, also needs to meet the following requirements:

  1. It should be cropped into squares.
  2. The face should be the main focus, making up 50%-70% of the image.
  3. The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

Organize your raw videos into the following directory structure:

dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   |-- 0003.mp4
|   `-- 0004.mp4

You can use any dataset_name, but ensure the videos directory is named as shown above.

Next, process the videos with the following commands:

python -m scripts.data_preprocess --input_dir dataset_name/videos --step 1
python -m scripts.data_preprocess --input_dir dataset_name/videos --step 2

Note: Execute steps 1 and 2 sequentially as they perform different tasks. Step 1 converts videos into frames, extracts audio from each video, and generates the necessary masks. Step 2 generates face embeddings using InsightFace and audio embeddings using Wav2Vec, and requires a GPU. For parallel processing, use the -p and -r arguments. The -p argument specifies the total number of instances to launch, dividing the data into p parts. The -r argument specifies which part the current process should handle. You need to manually launch multiple instances with different values for -r.

Generate the metadata JSON files with the following commands:

python scripts/extract_meta_info_stage1.py -r path/to/dataset -n dataset_name
python scripts/extract_meta_info_stage2.py -r path/to/dataset -n dataset_name

Replace `path/to/dataset

Core symbols most depended-on inside this repo

augmentation
called by 16
hallo/datasets/talk_video.py
update
called by 16
hallo/models/mutual_self_attention.py
_augmentation
called by 14
hallo/datasets/image_processor.py
torch_dfs
called by 12
hallo/models/mutual_self_attention.py
preprocess
called by 10
hallo/datasets/image_processor.py
clear
called by 9
hallo/models/mutual_self_attention.py
encode
called by 8
hallo/models/wav2vec.py
file_exists
called by 7
scripts/extract_meta_info_stage2.py

Shape

Method 169
Function 67
Class 53

Languages

Python100%

Modules by API surface

hallo/models/unet_2d_blocks.py33 symbols
hallo/utils/util.py32 symbols
hallo/models/unet_3d_blocks.py25 symbols
hallo/models/motion_module.py21 symbols
hallo/models/unet_2d_condition.py16 symbols
hallo/models/resnet.py15 symbols
hallo/models/attention.py14 symbols
hallo/datasets/image_processor.py14 symbols
hallo/models/unet_3d.py13 symbols
hallo/animate/face_animate_static.py12 symbols
scripts/train_stage2.py9 symbols
hallo/animate/face_animate.py8 symbols

Dependencies from manifests, versioned

accelerate0.28.0 · 1×
audio-separator0.17.2 · 1×
av12.1.0 · 1×
bitsandbytes0.43.1 · 1×
decord0.6.0 · 1×
diffusers0.27.2 · 1×
einops0.8.0 · 1×
gradio4.36.1 · 1×
insightface0.7.3 · 1×
isort5.13.2 · 1×
librosa0.10.2.post1 · 1×
mlflow2.13.1 · 1×

For agents

$ claude mcp add hallo \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact