MCPcopy
hub / github.com/huggingface/trl

github.com/huggingface/trl @v1.7.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.7.0 ↗
3,063 symbols 15,815 edges 356 files 1,032 documented · 34%
README

TRL - Transformers Reinforcement Learning

    <img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png" alt="TRL Banner">

A comprehensive library to post-train foundation models

<a href="https://github.com/huggingface/trl/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue"></a>
<a href="https://huggingface.co/docs/trl/index"><img alt="Documentation" src="https://img.shields.io/website?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online"></a>
<a href="https://github.com/huggingface/trl/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg"></a>
<a href="https://huggingface.co/trl-lib"><img alt="Hugging Face Hub" src="https://img.shields.io/badge/🤗%20Hub-trl--lib-yellow"></a>

🎉 What's New

TRL v1: We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the blog post to learn more.

🚢 Harbor: We now support Harbor — train agents against sandboxed task suites (instruction + sandbox image + in-sandbox verifier) via GRPOTrainer's environment_factory.

Overview

TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the 🤗 Transformers ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.

Highlights

  • Trainers: Various fine-tuning methods are easily accessible via trainers like SFTTrainer, GRPOTrainer, DPOTrainer, RewardTrainer and more.

  • Efficient and scalable:

  • Leverages 🤗 Accelerate to scale from single GPU to multi-node clusters using methods like DDP and DeepSpeed.
  • Full integration with 🤗 PEFT enables training on large models with modest hardware via quantization and LoRA/QLoRA.
  • Integrates 🦥 Unsloth for accelerating training using optimized kernels.

  • Command Line Interface (CLI): A simple interface lets you fine-tune with models without needing to write code.

Installation

Python Package

Install the library using pip:

pip install trl

From source

If you want to use the latest features before an official release, you can install TRL from source:

pip install git+https://github.com/huggingface/trl.git

Repository

If you want to use the examples you can clone the repository with the following command:

git clone https://github.com/huggingface/trl.git

Quick Start

For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.

SFTTrainer

Here is a basic example of how to use the SFTTrainer:

from trl import SFTTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
)
trainer.train()

GRPOTrainer

GRPOTrainer implements the Group Relative Policy Optimization (GRPO) algorithm that is more memory-efficient than PPO and was used to train Deepseek AI's R1.

from datasets import load_dataset
from trl import GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

[!NOTE] For reasoning models, use the reasoning_accuracy_reward() function for better results.

DPOTrainer

DPOTrainer implements the popular Direct Preference Optimization (DPO) algorithm that was used to post-train Llama 3 and many other models. Here is a basic example of how to use the DPOTrainer:

from datasets import load_dataset
from trl import DPOTrainer

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = DPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
)
trainer.train()

RewardTrainer

Here is a basic example of how to use the RewardTrainer:

from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

Command Line Interface (CLI)

You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):

SFT:

trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara \
    --output_dir Qwen2.5-0.5B-SFT

DPO:

trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO 

Read more about CLI in the relevant documentation section or use --help for more details.

Development

If you want to contribute to trl or customize it to your needs make sure to read the contribution guide and make sure you make a dev install:

git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .[dev]

Experimental

A minimal incubation area is available under trl.experimental for unstable / fast-evolving features. Anything there may change or be removed in any release without notice.

Example:

from trl.experimental.new_trainer import NewTrainer

Read more in the Experimental docs.

Citation

@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}

License

This repository's source code is available under the Apache-2.0 License.

Core symbols most depended-on inside this repo

from_pretrained
called by 548
trl/experimental/ppo/modeling_value_head.py
train
called by 402
trl/experimental/ppo/ppo_trainer.py
pad
called by 122
trl/trainer/utils.py
push_to_hub
called by 101
trl/experimental/ppo/modeling_value_head.py
log
called by 58
trl/trainer/dpo_trainer.py
check_transformers_version
called by 55
scripts/generate_tiny_models/_common.py
push_to_hub
called by 55
scripts/generate_tiny_models/_common.py
smoke_test
called by 53
scripts/generate_tiny_models/_common.py

Shape

Method 2,109
Function 529
Class 390
Route 35

Languages

Python100%

Modules by API surface

tests/test_sft_trainer.py135 symbols
tests/test_grpo_trainer.py130 symbols
tests/test_utils.py119 symbols
tests/test_vllm_client_server.py75 symbols
tests/test_data_utils.py68 symbols
tests/test_rloo_trainer.py66 symbols
tests/test_skills.py60 symbols
tests/experimental/test_gold_trainer.py58 symbols
tests/test_dpo_trainer.py52 symbols
tests/experimental/test_ppo_trainer.py52 symbols
tests/experimental/test_distillation_trainer.py50 symbols
trl/experimental/sdpo/sdpo_trainer.py48 symbols

Dependencies from manifests, versioned

accelerate1.4.0 · 1×
datasets4.7.0 · 1×
jinja2
transformers4.56.2 · 1×
use_json

For agents

$ claude mcp add trl \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact