hub / github.com/multimodal-art-projection/YuE

github.com/multimodal-art-projection/YuE @main sqlite

322 symbols 826 edges 24 files 117 documented · 36%

README

    <img src="https://github.com/multimodal-art-projection/YuE/raw/main/assets/logo/白底.svg" width="40%">










<a href="https://map-yue.github.io/">Demo 🎶</a> &nbsp;|&nbsp; 📑 <a href="https://arxiv.org/abs/2503.08638">Paper</a>



<a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-cot">YuE-s1-7B-anneal-en-cot 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl">YuE-s1-7B-anneal-en-icl 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-jp-kr-cot">YuE-s1-7B-anneal-jp-kr-cot 🤗</a>



<a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-jp-kr-icl">YuE-s1-7B-anneal-jp-kr-icl 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-zh-cot">YuE-s1-7B-anneal-zh-cot 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-s1-7B-anneal-zh-icl">YuE-s1-7B-anneal-zh-icl 🤗</a>



<a href="https://huggingface.co/m-a-p/YuE-s2-1B-general">YuE-s2-1B-general 🤗</a> &nbsp;|&nbsp; <a href="https://huggingface.co/m-a-p/YuE-upsampler">YuE-upsampler 🤗</a>

Our model's name is YuE (乐). In Chinese, the word means "music" and "happiness." Some of you may find words that start with Yu hard to pronounce. If so, you can just call it "yeah." We wrote a song with our model's name, see here.

YuE is a groundbreaking series of open-source foundation models designed for music generation, specifically for transforming lyrics into full songs (lyrics2song). It can generate a complete song, lasting several minutes, that includes both a catchy vocal track and accompaniment track. YuE is capable of modeling diverse genres/languages/vocal techniques. Please visit the Demo Page for amazing vocal performance.

News and Updates

📌 Join Us on Discord!
2025.06.04 🔥 Now YuE supports LoRA finetune.
2025.03.12 🔥 Paper Released🎉: We now release YuE technical report!!! We discuss all the technical details, findings, and lessons learned. Enjoy, and feel free to cite us~
2025.03.11 🫶 Now YuE supports incremental song generation!!! See YuE-UI by joeljuvel. YuE-UI is a Gradio-based interface supporting batch generation, output selection, and continuation. You can flexibly experiment with audio prompts and different model settings, visualize your progress on an interactive timeline, rewind actions, quickly preview audio outputs at stage 1 before committing to refinement, and fully save/load your sessions (JSON format). Optimized to run smoothly even on GPUs with just 8GB VRAM using quantized models.
2025.02.17 🫶 Now YuE supports music continuation and Google Colab! See YuE-extend by Mozer.
2025.02.07 🎉 Get YuE for Windows on pinokio.
2025.01.30 🔥 Inference Update: We now support dual-track ICL mode! You can prompt the model with a reference song, and it will generate a new song in a similar style (voice cloning demo by @abrakjamson, music style transfer demo by @cocktailpeanut, etc.). Try it out! 🔥🔥🔥 P.S. Be sure to check out the demos first—they're truly impressive.
2025.01.30 🔥 Announcement: A New Era Under Apache 2.0 🔥: We are thrilled to announce that, in response to overwhelming requests from our community, YuE is now officially licensed under the Apache 2.0 license. We sincerely hope this marks a watershed moment—akin to what Stable Diffusion and LLaMA have achieved in their respective fields—for music generation and creative AI. 🎉🎉🎉
2025.01.29 🎉: We have updated the license description. we ENCOURAGE artists and content creators to sample and incorporate outputs generated by our model into their own works, and even monetize them. The only requirement is to credit our name: YuE by HKUST/M-A-P (alphabetic order).
2025.01.28 🫶: Thanks to Fahd for creating a tutorial on how to quickly get started with YuE. Here is his demonstration.
2025.01.26 🔥: We have released the YuE series.

TODOs📋

[ ] Support stemgen mode https://github.com/multimodal-art-projection/YuE/issues/21
[ ] Support llama.cpp https://github.com/ggerganov/llama.cpp/issues/11467
[ ] Support transformers tensor parallel. https://github.com/multimodal-art-projection/YuE/issues/7
[ ] Online serving on huggingface space.
[ ] Support vLLM and sglang https://github.com/multimodal-art-projection/YuE/issues/66
[x] Release paper to Arxiv.
[x] Example LoRA finetune code using 🤗 Transformers.
[x] Support Colab: YuE-extend by Mozer
[x] Support gradio interface. https://github.com/multimodal-art-projection/YuE/issues/1
[x] Support dual-track ICL mode.
[x] Fix "instrumental" naming bug in output files. https://github.com/multimodal-art-projection/YuE/pull/26
[x] Support seeding https://github.com/multimodal-art-projection/YuE/issues/20
[x] Allow --repetition_penalty to customize repetition penalty. https://github.com/multimodal-art-projection/YuE/issues/45

Hardware and Performance

GPU Memory

YuE requires significant GPU memory for generating long sequences. Below are the recommended configurations: - For GPUs with 24GB memory or less: Run up to 2 sessions to avoid out-of-memory (OOM) errors. Thanks to the community, there are YuE-exllamav2 and YuEGP for those with limited GPU resources. While both enhance generation speed and coherence, they may compromise musicality. (P.S. Better prompts & ICL help!) - For full song generation (many sessions, e.g., 4 or more): Use GPUs with at least 80GB memory. i.e. H800, A100, or multiple RTX4090s with tensor parallel.

To customize the number of sessions, the interface allows you to specify the desired session count. By default, the model runs 2 sessions (1 verse + 1 chorus) to avoid OOM issue.

Execution Time

On an H800 GPU, generating 30s audio takes 150 seconds. On an RTX 4090 GPU, generating 30s audio takes approximately 360 seconds.

🪟 Windows Users Quickstart

For a one-click installer, use Pinokio.
To use Gradio with Docker, see: YuE-for-Windows

🐧 Linux/WSL Users Quickstart

For a quick start, watch this video tutorial by Fahd: Watch here.
If you're new to machine learning or the command line, we highly recommend watching this video first.

To use a GUI/Gradio interface, check out:
- YuE-exllamav2-UI - YuEGP - YuE-Interface

1. Install environment and dependencies

Make sure properly install flash attention 2 to reduce VRAM usage.

# We recommend using conda to create a new environment.
conda create -n yue python=3.8 # Python >=3.8 is recommended.
conda activate yue
# install cuda >= 11.8
conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install -r <(curl -sSL https://raw.githubusercontent.com/multimodal-art-projection/YuE/main/requirements.txt)

# For saving GPU memory, FlashAttention 2 is mandatory. 
# Without it, long audio may lead to out-of-memory (OOM) errors.
# Be careful about matching the cuda version and flash-attn version
pip install flash-attn --no-build-isolation

2. Download the infer code and tokenizer

# Make sure you have git-lfs installed (https://git-lfs.com)
# if you don't have root, see https://github.com/git-lfs/git-lfs/issues/4134#issuecomment-1635204943
sudo apt update
sudo apt install git-lfs
git lfs install
git clone https://github.com/multimodal-art-projection/YuE.git

cd YuE/inference/
git clone https://huggingface.co/m-a-p/xcodec_mini_infer

3. Run the inference

Now generate music with YuE using 🤗 Transformers. Make sure your step 1 and 2 are properly set up.

Note: - Set --run_n_segments to the number of lyric sections if you want to generate a full song. Additionally, you can increase --stage2_batch_size based on your available GPU memory.

You may customize the prompt in genre.txt and lyrics.txt. See prompt engineering guide here.
You can increase --stage2_batch_size to speed up the inference, but be careful for OOM.
LM ckpts will be automatically downloaded from huggingface.

# This is the CoT mode.
cd YuE/inference/
python infer.py \
    --cuda_idx 0 \
    --stage1_model m-a-p/YuE-s1-7B-anneal-en-cot \
    --stage2_model m-a-p/YuE-s2-1B-general \
    --genre_txt ../prompt_egs/genre.txt \
    --lyrics_txt ../prompt_egs/lyrics.txt \
    --run_n_segments 2 \
    --stage2_batch_size 4 \
    --output_dir ../output \
    --max_new_tokens 3000 \
    --repetition_penalty 1.1

We also support music in-context-learning (provide a reference song), there are 2 types: single-track (mix/vocal/instrumental) and dual-track.

Note: - ICL requires a different ckpt, e.g. m-a-p/YuE-s1-7B-anneal-en-icl.

Music ICL generally requires a 30s audio segment. The model will write new songs with similar style of the provided audio, and may improve musicality.
Dual-track ICL works better in general, requiring both vocal and instrumental tracks.
For single-track ICL, you can provide a mix, vocal, or instrumental track.
You can separate the vocal and instrumental tracks using python-audio-separator or Ultimate Vocal Remover GUI.

# This is the dual-track ICL mode.
# To turn on dual-track mode, enable `--use_dual_tracks_prompt`
# and provide `--vocal_track_prompt_path`, `--instrumental_track_prompt_path`, 
# `--prompt_start_time`, and `--prompt_end_time`
# The ref audio is taken from GTZAN test set.
cd YuE/inference/
python infer.py \
    --cuda_idx 0 \
    --stage1_model m-a-p/YuE-s1-7B-anneal-en-icl \
    --stage2_model m-a-p/YuE-s2-1B-general \
    --genre_txt ../prompt_egs/genre.txt \
    --lyrics_txt ../prompt_egs/lyrics.txt \
    --run_n_segments 2 \
    --stage2_batch_size 4 \
    --output_dir ../output \
    --max_new_tokens 3000 \
    --repetition_penalty 1.1 \
    --use_dual_tracks_prompt \
    --vocal_track_prompt_path ../prompt_egs/pop.00001.Vocals.mp3 \
    --instrumental_track_prompt_path ../prompt_egs/pop.00001.Instrumental.mp3 \
    --prompt_start_time 0 \
    --prompt_end_time 30

# This is the single-track (mix/vocal/instrumental) ICL mode.
# To turn on single-track ICL, enable `--use_audio_prompt`, 
# and provide `--audio_prompt_path` , `--prompt_start_time`, and `--prompt_end_time`. 
# The ref audio is taken from GTZAN test set.
cd YuE/inference/
python infer.py \
    --cuda_idx 0 \
    --stage1_model m-a-p/YuE-s1-7B-anneal-en-icl \
    --stage2_model m-a-p/YuE-s2-1B-general \
    --genre_txt ../prompt_egs/genre.txt \
    --lyrics_txt ../prompt_egs/lyrics.txt \
    --run_n_segments 2 \
    --stage2_batch_size 4 \
    --output_dir ../output \
    --max_new_tokens 3000 \
    --repetition_penalty 1.1 \
    --use_audio_prompt \
    --audio_prompt_path ../prompt_egs/pop.00001.mp3 \
    --prompt_start_time 0 \
    --prompt_end_time 30

Prompt Engineering Guide

The prompt consists of three parts: genre tags, lyrics, and ref audio.

Genre Tagging Prompt

An example genre tagging prompt can be found here.
A stable tagging prompt usually consists of five components: genre, instrument, mood, gender, and timbre. All five should be included if possible, separated by space (space delimiter).
Although our tags have an open vocabulary, we have provided the top 200 most commonly used tags. It is recommended to select tags from this list for more stable results.
The order of the tags is flexible. For example, a stable genre tagging prompt might look like: "inspiring female uplifting pop airy vocal electronic bright vocal vocal."
Additionally, we have introduced the "Mandarin" and "Cantonese" tags to distinguish between Mandarin and Cantonese, as their lyrics often share similarities.

Lyrics Prompt

An example lyric prompt can be found here.
We support multiple languages, including but not limited to English, Mandarin Chinese, Cantonese, Japanese, and Korean. The default top language distribution during the annealing phase is revealed in issue 12. A language ID on a specific annealing checkpoint indicates that we have adjusted the mixing ratio to enhance support for that language.
The lyrics prompt should be divided into sessions, with structure labels (e.g., [verse], [chorus], [bridge], [outro]) prepended. Each session should be separated by 2 newline character "\n\n".
DONOT put too many words in a single segment, since each session is around 30s (--max_new_tokens 3000 by default).
We find that [intro] label is less stable, so we recommend starting with [verse] or [chorus].
For generating music with no vocal (instrumental only), see [issue 18](https://github.

Core symbols most depended-on inside this repo

log_single_rank

called by 47

finetune/core/datasets/utils.py

get

called by 35

finetune/core/datasets/indexed_dataset.py

get_size_in_bytes

called by 24

finetune/core/preprocess_data_conditional_xcodec_segment.py

write

called by 17

finetune/core/datasets/indexed_dataset.py

npy2ids

called by 11

finetune/core/preprocess_data_conditional_xcodec.py

tokenize

called by 8

inference/mmtokenizer.py

tokenize

called by 8

finetune/core/tokenizer/mmtokenizer.py

split

called by 7

finetune/core/preprocess_data_conditional_xcodec.py

Shape

Method 212

Function 81

Class 29

Languages

Python100%

Modules by API surface

inference/mmtokenizer.py50 symbols

finetune/core/tokenizer/mmtokenizer.py50 symbols

finetune/core/datasets/indexed_dataset.py40 symbols

finetune/core/preprocess_data_conditional_xcodec.py35 symbols

finetune/core/arguments.py24 symbols

finetune/core/datasets/gpt_dataset.py15 symbols

finetune/core/preprocess_data_conditional_xcodec_segment.py14 symbols

inference/codecmanipulator.py12 symbols

finetune/tools/codecmanipulator.py12 symbols

inference/infer.py10 symbols

finetune/core/datasets/megatron_dataset.py9 symbols

finetune/scripts/train_lora.py8 symbols

Dependencies from manifests, versioned

GitPython3.1.44 · 1×

Jinja23.1.4 · 1×

MarkupSafe2.1.5 · 1×

PyYAML6.0.2 · 1×

accelerate0.26.0 · 1×

annotated-types0.7.0 · 1×

blobfile3.0.0 · 1×

certifi2025.4.26 · 1×

charset-normalizer3.4.2 · 1×

click8.2.0 · 1×

deepspeed0.16.7 · 1×

descript-audiotools0.7.2 · 1×

For agents

$ claude mcp add YuE \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact