hub / github.com/meta-pytorch/torchtune

github.com/meta-pytorch/torchtune @v0.6.1 sqlite

repository ↗ · DeepWiki ↗ · release v0.6.1 ↗

2,872 symbols 11,936 edges 417 files 1,104 documented · 38% 2 cross-repo links

README

torchtune

📣 Recent updates 📣

February 2025: Multi-node training is officially open for business in torchtune! Full finetune on multiple nodes to take advantage of larger batch sizes and models.
December 2024: torchtune now supports Llama 3.3 70B! Try it out by following our installation instructions here, then run any of the configs here.
November 2024: torchtune has released v0.4.0 which includes stable support for exciting features like activation offloading and multimodal QLoRA
November 2024: torchtune has added Gemma2 to its models!
October 2024: torchtune added support for Qwen2.5 models - find the configs here
September 2024: torchtune has support for Llama 3.2 11B Vision, Llama 3.2 3B, and Llama 3.2 1B models! Try them out by following our installation instructions here, then run any of the text configs here or vision configs here.

Overview 📚

torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:

Hackable training recipes for SFT, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training
Simple PyTorch implementations of popular LLMs like Llama, Gemma, Mistral, Phi, Qwen, and more
Best-in-class memory efficiency, performance improvements, and scaling, utilizing the latest PyTorch APIs
YAML configs for easily configuring training, evaluation, quantization or inference recipes

Post-training recipes

torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.

Supervised Finetuning (SFT)

Type of Weight Update	1 Device	>1 Device	>1 Node
Full	✅	✅	✅
LoRA/QLoRA	✅	✅	❌

Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device

You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.

Knowledge Distillation (KD)

Type of Weight Update	1 Device	>1 Device	>1 Node
Full	❌	❌	❌
LoRA/QLoRA	✅	✅	❌

Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed

You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.

Reinforcement Learning / Reinforcement Learning from Human Feedback (RLHF)

Method	Type of Weight Update	1 Device	>1 Device	>1 Node
DPO	Full	❌	✅	❌
	LoRA/QLoRA	✅	✅	❌
PPO	Full	✅	❌	❌
	LoRA/QLoRA	❌	❌	❌
GRPO	Full	🚧	🚧	🚧
	LoRA/QLoRA	❌	❌	❌

Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device

You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.

Quantization-Aware Training (QAT)

Type of Weight Update	1 Device	>1 Device	>1 Node
Full	❌	✅	❌
LoRA/QLoRA	❌	✅	❌

Example: tune run qat_distributed --config llama3_1/8B_qat_lora

You can also run e.g. tune ls qat_distributed for a full list of available configs.

The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.

Models

For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:

Model	Sizes
Llama3.3	70B [models, configs]
Llama3.2-Vision	11B, 90B [models, configs]
Llama3.2	1B, 3B [models, configs]
Llama3.1	8B, 70B, 405B [models, configs]
Mistral	7B [models, configs]
Gemma2	2B, 9B, 27B [models, configs]
Microsoft Phi4	14B [models, configs]
Microsoft Phi3	Mini [models, configs]
Qwen2.5	0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs]
Qwen2	0.5B, 1.5B, 7B [models, configs]

We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.

Memory and training speed

Below is an example of the memory requirements and training speed for different Llama 3.1 models.

[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.

If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.

Model	Finetuning Method	Runnable On	Peak Memory per GPU	Tokens/sec *
Llama 3.1 8B	Full finetune	1x 4090	18.9 GiB	1650
Llama 3.1 8B	Full finetune	1x A6000	37.4 GiB	2579
Llama 3.1 8B	LoRA	1x 4090	16.2 GiB	3083
Llama 3.1 8B	LoRA	1x A6000	30.3 GiB	4699
Llama 3.1 8B	QLoRA	1x 4090	7.4 GiB	2413
Llama 3.1 70B	Full finetune	8x A100	13.9 GiB **	1568
Llama 3.1 70B	LoRA	8x A100	27.6 GiB	3497
Llama 3.1 405B	QLoRA	8x A100	44.8 GB	653

*= Measured over one full training epoch

**= Uses CPU offload with fused optimizer

Optimization flags

torchtune exposes a number of levers for memory efficiency and performance. The table below demonstrates the effects of applying some of these techniques sequentially to the Llama 3.2 3B model. Each technique is added on top of the previous one, except for LoRA and QLoRA, which do not use optimizer_in_bwd or AdamW8bit optimizer.

Baseline uses Recipe=full_finetune_single_device, Model=Llama 3.2 3B, Batch size=2, Max sequence length=4096, Precision=bf16, Hardware=A100

Technique	Peak Memory Active (GiB)	% Change Memory vs Previous	Tokens Per Second	% Change Tokens/sec vs Previous
Baseline	25.5	-	2091	-
+ Packed Dataset	60.0	+135.16%	7075	+238.40%
+ Compile	51.0	-14.93%	8998	+27.18%
+ Chunked Cross Entropy	42.9	-15.83%	9174	+1.96%
+ Activation Checkpointing	24.9	-41.93%	7210	-21.41%
+ Fuse optimizer step into backward	23.1	-7.29%	7309	+1.38%
+ Activation Offloading	21.8	-5.48%	7301	-0.11%
+ 8-bit AdamW	17.6	-19.63%	6960	-4.67%
LoRA	8.5	-51.61%	8210	+17.96%
QLoRA	4.6	-45.71%	8035	-2.13%

The final row in the table vs baseline + Packed Dataset uses 81.9% less memory with a 284.3% increase in tokens per second.

Command to reproduce final row.

tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device \
dataset.packed=True \
compile=True \
loss=torchtune.modules.loss.CEWithChunkedOutputLoss \
enable_activation_checkpointing=True \
optimizer_in_bwd=False \
enable_activation_offloading=True \
optimizer=torch.optim.AdamW \
tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
epochs=1 \
batch_size=2

Installation 🛠️

torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for finetuning multimodal LLMs and torchao for the latest in quantization techniques; you should install these as well.

Install stable release

# Install stable PyTorch, torchvision, torchao stable releases
pip install torch torchvision torchao
pip install torchtune

Install nightly release

# Install PyTorch, torchvision, torchao nightlies
pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu121/cu124/cu126
pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu

You can also check out our install documentation for more information, including installing torchtune from source.

To confirm that the package is installed correctly, you can run the following command:

tune --help

And should see the following output:

usage: tune [-h] {ls,cp,download,run,validate} ...

Welcome to the torchtune CLI!

options:
  -h, --help            show this help message and exit

...

Get Started 🚀

To get started with torchtune, see our First Finetune Tutorial. Our End-to-End Workflow Tutorial will show you how to evaluate, quantize, and run inference with a Llama model. The rest of this section will provide a quick overview of these steps with Llama3.1.

Downloading a model

Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.

To downlo

Core symbols most depended-on inside this repo

get

called by 363

torchtune/modules/early_exit_loss.py

update

called by 118

torchtune/modules/kv_cache.py

load_state_dict

called by 85

torchtune/training/memory.py

step

called by 59

torchtune/training/_profiler.py

warn

called by 57

torchtune/utils/_logging.py

size

called by 55

torchtune/modules/kv_cache.py

device

called by 41

recipes/eleuther_eval.py

set_seed

called by 37

torchtune/training/seed.py

Shape

Method 1,817

Function 584

Class 422

Route 49

Languages

Python100%

Modules by API surface

torchtune/training/metric_logging.py49 symbols

tests/torchtune/data/test_messages.py45 symbols

recipes/eleuther_eval.py38 symbols

tests/torchtune/modules/test_transformer_decoder.py34 symbols

torchtune/models/flux/_autoencoder.py33 symbols

tests/torchtune/training/test_activation_offloading.py32 symbols

tests/torchtune/modules/model_fusion/test_fusion_layers.py32 symbols

tests/torchtune/modules/peft/test_utils.py31 symbols

tests/torchtune/modules/model_fusion/test_early_fusion.py30 symbols

tests/torchtune/generation/test_generation.py28 symbols

torchtune/modules/transformer.py27 symbols

tests/torchtune/modules/test_layer_dropout.py27 symbols

Used by 2 indexed graphs manifest dependencies, hub-wide

github.com/SesameAILabs/csm

github.com/mudler/LocalAI

Dependencies from manifests, versioned

datasets1×

sphinx5.0.0 · 1×

torchdata0.11.0 · 1×

For agents

$ claude mcp add torchtune \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact