Overview | Installation | Get Started | Documentation | Community | Citing torchtune | License
torchtune is a PyTorch library for easily authoring, post-training, and experimenting with LLMs. It provides:
torchtune supports the entire post-training lifecycle. A successful post-trained model will likely utilize several of the below methods.
| Type of Weight Update | 1 Device | >1 Device | >1 Node |
|---|---|---|---|
| Full | ✅ | ✅ | ✅ |
| LoRA/QLoRA | ✅ | ✅ | ❌ |
Example: tune run lora_finetune_single_device --config llama3_2/3B_lora_single_device
You can also run e.g. tune ls lora_finetune_single_device for a full list of available configs.
| Type of Weight Update | 1 Device | >1 Device | >1 Node |
|---|---|---|---|
| Full | ❌ | ❌ | ❌ |
| LoRA/QLoRA | ✅ | ✅ | ❌ |
Example: tune run knowledge_distillation_distributed --config qwen2/1.5B_to_0.5B_KD_lora_distributed
You can also run e.g. tune ls knowledge_distillation_distributed for a full list of available configs.
| Method | Type of Weight Update | 1 Device | >1 Device | >1 Node |
|---|---|---|---|---|
| DPO | Full | ❌ | ✅ | ❌ |
| LoRA/QLoRA | ✅ | ✅ | ❌ | |
| PPO | Full | ✅ | ❌ | ❌ |
| LoRA/QLoRA | ❌ | ❌ | ❌ | |
| GRPO | Full | 🚧 | 🚧 | 🚧 |
| LoRA/QLoRA | ❌ | ❌ | ❌ |
Example: tune run lora_dpo_single_device --config llama3_1/8B_dpo_single_device
You can also run e.g. tune ls full_dpo_distributed for a full list of available configs.
| Type of Weight Update | 1 Device | >1 Device | >1 Node |
|---|---|---|---|
| Full | ❌ | ✅ | ❌ |
| LoRA/QLoRA | ❌ | ✅ | ❌ |
Example: tune run qat_distributed --config llama3_1/8B_qat_lora
You can also run e.g. tune ls qat_distributed for a full list of available configs.
The above configs are just examples to get you started. The full list of recipes can be found here. If you'd like to work on one of the gaps you see, please submit a PR! If there's a entirely new post-training method you'd like to see implemented in torchtune, feel free to open an Issue.
For the above recipes, torchtune supports many state-of-the-art models available on the Hugging Face Hub or Kaggle Hub. Some of our supported models:
| Model | Sizes |
|---|---|
| Llama3.3 | 70B [models, configs] |
| Llama3.2-Vision | 11B, 90B [models, configs] |
| Llama3.2 | 1B, 3B [models, configs] |
| Llama3.1 | 8B, 70B, 405B [models, configs] |
| Mistral | 7B [models, configs] |
| Gemma2 | 2B, 9B, 27B [models, configs] |
| Microsoft Phi4 | 14B [models, configs] |
| Microsoft Phi3 | Mini [models, configs] |
| Qwen2.5 | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B [models, configs] |
| Qwen2 | 0.5B, 1.5B, 7B [models, configs] |
We're always adding new models, but feel free to file an issue if there's a new one you would like to see in torchtune.
Below is an example of the memory requirements and training speed for different Llama 3.1 models.
[!NOTE] For ease of comparison, all the below numbers are provided for batch size 2 (without gradient accumulation), a dataset packed to sequence length 2048, and torch compile enabled.
If you are interested in running on different hardware or with different models, check out our documentation on memory optimizations here to find the right setup for you.
| Model | Finetuning Method | Runnable On | Peak Memory per GPU | Tokens/sec * |
|---|---|---|---|---|
| Llama 3.1 8B | Full finetune | 1x 4090 | 18.9 GiB | 1650 |
| Llama 3.1 8B | Full finetune | 1x A6000 | 37.4 GiB | 2579 |
| Llama 3.1 8B | LoRA | 1x 4090 | 16.2 GiB | 3083 |
| Llama 3.1 8B | LoRA | 1x A6000 | 30.3 GiB | 4699 |
| Llama 3.1 8B | QLoRA | 1x 4090 | 7.4 GiB | 2413 |
| Llama 3.1 70B | Full finetune | 8x A100 | 13.9 GiB ** | 1568 |
| Llama 3.1 70B | LoRA | 8x A100 | 27.6 GiB | 3497 |
| Llama 3.1 405B | QLoRA | 8x A100 | 44.8 GB | 653 |
*= Measured over one full training epoch
**= Uses CPU offload with fused optimizer
torchtune exposes a number of levers for memory efficiency and performance. The table below demonstrates the effects of applying some of these techniques sequentially to the Llama 3.2 3B model. Each technique is added on top of the previous one, except for LoRA and QLoRA, which do not use optimizer_in_bwd or AdamW8bit optimizer.
Baseline uses Recipe=full_finetune_single_device, Model=Llama 3.2 3B, Batch size=2, Max sequence length=4096, Precision=bf16, Hardware=A100
| Technique | Peak Memory Active (GiB) | % Change Memory vs Previous | Tokens Per Second | % Change Tokens/sec vs Previous |
|---|---|---|---|---|
| Baseline | 25.5 | - | 2091 | - |
| + Packed Dataset | 60.0 | +135.16% | 7075 | +238.40% |
| + Compile | 51.0 | -14.93% | 8998 | +27.18% |
| + Chunked Cross Entropy | 42.9 | -15.83% | 9174 | +1.96% |
| + Activation Checkpointing | 24.9 | -41.93% | 7210 | -21.41% |
| + Fuse optimizer step into backward | 23.1 | -7.29% | 7309 | +1.38% |
| + Activation Offloading | 21.8 | -5.48% | 7301 | -0.11% |
| + 8-bit AdamW | 17.6 | -19.63% | 6960 | -4.67% |
| LoRA | 8.5 | -51.61% | 8210 | +17.96% |
| QLoRA | 4.6 | -45.71% | 8035 | -2.13% |
The final row in the table vs baseline + Packed Dataset uses 81.9% less memory with a 284.3% increase in tokens per second.
Command to reproduce final row.
tune run lora_finetune_single_device --config llama3_2/3B_qlora_single_device \
dataset.packed=True \
compile=True \
loss=torchtune.modules.loss.CEWithChunkedOutputLoss \
enable_activation_checkpointing=True \
optimizer_in_bwd=False \
enable_activation_offloading=True \
optimizer=torch.optim.AdamW \
tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
epochs=1 \
batch_size=2
torchtune is tested with the latest stable PyTorch release as well as the preview nightly version. torchtune leverages torchvision for finetuning multimodal LLMs and torchao for the latest in quantization techniques; you should install these as well.
# Install stable PyTorch, torchvision, torchao stable releases
pip install torch torchvision torchao
pip install torchtune
# Install PyTorch, torchvision, torchao nightlies
pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu126 # full options are cpu/cu118/cu121/cu124/cu126
pip install --pre --upgrade torchtune --extra-index-url https://download.pytorch.org/whl/nightly/cpu
You can also check out our install documentation for more information, including installing torchtune from source.
To confirm that the package is installed correctly, you can run the following command:
tune --help
And should see the following output:
usage: tune [-h] {ls,cp,download,run,validate} ...
Welcome to the torchtune CLI!
options:
-h, --help show this help message and exit
...
To get started with torchtune, see our First Finetune Tutorial. Our End-to-End Workflow Tutorial will show you how to evaluate, quantize, and run inference with a Llama model. The rest of this section will provide a quick overview of these steps with Llama3.1.
Follow the instructions on the official meta-llama repository to ensure you have access to the official Llama model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.
To downlo
$ claude mcp add torchtune \
-- python -m otcore.mcp_server <graph>