![]()
TorchRL is a PyTorch-native toolkit for reinforcement learning, decision making, robotics, and simulation. It is not a single algorithm implementation or a narrow benchmark suite: it is a collection of composable pieces for building RL systems while keeping the code close to the PyTorch programming model. Recent work has made this especially strong for recurrent RL, MuJoCo-based control, multi-agent training, replay-buffer and collector infrastructure, and reusable loss/value-estimation components.
The library is built around three ideas:
That common data model is TensorDict,
a dictionary-like tensor container with PyTorch operations, device transfers,
shared-memory support, memmaps, lazy views, and nn.Module wrappers.
Getting started | API reference | Tutorials | Knowledge base | Examples | SOTA implementations
TorchRL 0.13 and the preceding development cycle bring several user-visible improvements that are worth surfacing up front:
MultiAgentGAE,
value-normalization utilities, and mixer configs;ActionScaling,
FlattenAction, NextObservationDelta, compact shifted estimators, and
chunked forwards.TorchRL represents an RL interaction as a TensorDict that moves through a small number of reusable components:
TensorDict
-> policy module writes actions and log-probs
-> environment reads actions and writes next observations, rewards, done flags
-> collector batches trajectories from one or many workers
-> replay buffer stores, samples, prioritizes, and transforms data
-> loss module reads named keys and writes differentiable losses
-> optimizer updates ordinary PyTorch parameters
The same object can carry observations, pixels, actions, rewards, masks, recurrent states, agent groups, sampled indices, priorities, or custom task fields. The result is less glue code and fewer hidden assumptions about what each algorithm or environment returns.
A local rollout is just a TensorDict passed between a PyTorch module and an environment:
import torch
from tensordict.nn import TensorDictModule
from torch import nn
from torchrl.envs import PendulumEnv, StepCounter, TransformedEnv
# A PyTorch-native environment with an ordinary transform stack.
env = TransformedEnv(PendulumEnv(), StepCounter(max_steps=200))
# Policies are regular nn.Modules wrapped with explicit TensorDict keys.
policy = TensorDictModule(
nn.Sequential(
nn.LazyLinear(64),
nn.Tanh(),
nn.Linear(64, 1),
nn.Tanh(),
),
in_keys=["observation"],
out_keys=["action"],
)
rollout = env.rollout(max_steps=32, policy=policy)
assert rollout.batch_size == torch.Size([32])
assert rollout["next", "reward"].shape[:1] == torch.Size([32])
Nothing in this pattern is specific to Pendulum. The same keys-and-TensorDict interface is used by batched environments, multi-agent tasks, collectors, replay buffers, recurrent modules, transforms, and losses.
RL code tends to accumulate special cases: tuples from one environment, dicts from another, separate arrays for recurrent states, masks next to data rather than inside it, and losses that silently assume a particular batch layout. TorchRL uses TensorDict to make those assumptions explicit.
TensorDict supports common tensor operations while preserving named fields:
# These operations preserve the structure and operate on every compatible value.
batch = torch.stack(list_of_tensordicts, dim=0)
batch = batch.reshape(-1)
batch = batch.to("cuda")
mini_batch = batch[:128]
# Nested keys make multi-agent, recurrent, and next-state data explicit.
reward = batch["next", "reward"]
agent_obs = batch["agents", "observation"]
hidden = batch["recurrent_state", "h"]
This is the reason TorchRL components compose: a collector can emit a TensorDict, a replay buffer can store it without losing structure, a transform can add or remove keys, and a loss can read exactly the keys it needs.
TorchRL includes native environments, wrappers for popular environment libraries, and vectorized containers for running many environments at once. The environment API exposes specs for observations, actions, rewards, and done flags, so policies and transforms can check shapes, devices, dtypes, and bounds before a training job runs for hours.
Environment support includes:
PendulumEnv and custom MuJoCo tasks.SerialEnv, ParallelEnv, and batched wrappers for local vectorization and
multiprocessing.Transforms are first-class TorchRL modules. They can run on-device, participate in specs, and be inserted, removed, or composed without wrapping the whole environment in opaque adapter layers.
from torchrl.envs import Compose, DoubleToFloat, ObservationNorm, TransformedEnv
from torchrl.envs.libs.gym import GymEnv
base_env = GymEnv("HalfCheetah-v4", device="cuda:0")
env = TransformedEnv(
base_env,
Compose(
ObservationNorm(in_keys=["observation"]),
DoubleToFloat(),
),
)
Collectors are the bridge between policies and environments. A collector owns the execution loop, batches trajectories, handles devices, and can update policy weights while environments keep running.
TorchRL includes single-process, async, multiprocess, and distributed collectors. This lets the same policy and loss code be used across small smoke tests, GPU-heavy simulation, CPU environment farms, or asynchronous evaluation setups.
from torchrl.collectors import Collector
collector = Collector(
create_env_fn=env,
policy=policy,
frames_per_batch=1024,
total_frames=1_000_000,
)
for data in collector:
# data is a TensorDict with time, environment, and key structure preserved.
train_step(data)
For larger jobs, the collector family adds async execution, multiple worker processes, weight updaters, evaluator loops, profiling hooks, and fake-data helpers for testing downstream code without stepping an expensive environment.
TorchRL replay buffers are modular: storage, sampler, writer, collate function, transforms, prefetching, priority updates, and device movement are separate pieces. That makes it possible to use the same interface for simple in-memory replay, memmap-backed storage, prioritized replay, CUDA-aware sampling, offline datasets, HER, or custom storage layouts.
from torchrl.data import LazyMemmapStorage, TensorDictPrioritizedReplayBuffer
buffer = TensorDictPrioritizedReplayBuffer(
storage=LazyMemmapStorage(1_000_000),
alpha=0.7,
beta=0.5,
batch_size=256,
prefetch=2,
)
buffer.extend(collector_batch)
sample = buffer.sample()
Replay buffers understand TensorDict structure, so they can store trajectories, nested agent data, recurrent states, HER relabeling metadata, or offline datasets without flattening everything into parallel Python containers.
TorchRL modules are ordinary PyTorch modules with explicit input and output keys. The library provides actors, critics, actor-critic operators, recurrent modules, distribution wrappers, exploration modules, world models, decision transformers, robot-learning models, and helper utilities for inferring specs from environments.
A stochastic actor can be assembled from familiar PyTorch layers:
from tensordict.nn import TensorDictModule
from tensordict.nn.distributions import NormalParamExtractor
from torch import nn
from torchrl.modules import ProbabilisticActor, TanhNormal
params = TensorDictModule(
nn.Sequential(
nn.LazyLinear(256),
nn.Tanh(),
nn.Linear(256, 2),
NormalParamExtractor(),
),
in_keys=["observation"],
out_keys=["loc", "scale"],
)
actor = ProbabilisticActor(
params,
in_keys=["loc", "scale"],
out_keys=["action"],
distribution_class=TanhNormal,
distribution_kwargs={"low": -1.0, "high": 1.0},
return_log_prob=True,
)
The explicit key contract makes it clear what data a module consumes and produces, and it allows losses, collectors, and transforms to be reconfigured without editing the model itself.
TorchRL objectives are loss modules that read TensorDict keys, compute losses, and expose configurable key mappings. They cover policy-gradient methods, actor-critic algorithms, Q-learning, offline RL, imitation learning, model-based RL, and multi-agent RL.
Examples include PPO, SAC, DQN, TD3, REDQ, IQL, CQL, Decision Transformer, Dreamer, CrossQ, GAIL, behavior cloning, ACT, MAPPO, IPPO, and QMIX/VDN. Value-estimator utilities provide GAE, TD(lambda), V-trace, lambda returns, multi-agent advantages, and vectorized return computation.
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE
loss = ClipPPOLoss(actor_network=actor, critic_network=critic)
advantage = GAE(value_network=critic, gamma=0.99, lmbda=0.95)
data = advantage(data)
losses = loss(data)
loss_value = losses["loss_objective"] + losses["loss_critic"] + losses["loss_entropy"]
For higher-level workflows, TorchRL also provides trainer utilities and Hydra configuration dataclasses that assemble environments, networks, collectors, losses, optimizers, loggers, hooks, and schedules into reproducible recipes.
Multi-agent data is represented as TensorDict structure rather than a separate
parallel convention. Agent observations, actions, rewards, masks, and shared
state can live under nested keys such as ("agents", "observation"), while
losses and modules declare which keys they use.
TorchRL supports multi-agent environments and algorithms through VMAS,
PettingZoo, Melting Pot, SMACv2, OpenSpiel, multi-agent trainers, and dedicated
objectives. The 0.13 line adds MAPPO, IPPO, MultiAgentGAE, ValueNorm,
PopArtValueNorm, RunningValueNorm, and cross-agent critic utilities.
The same component style also covers model-based and imitation-learning work: Dreamer/DreamerV3 objectives and RSSM modules, Decision Transformer components, behavior cloning losses, and ACT-style action chunking all share the same TensorDict and key-dispatch conve