
Official PyTorch implementation of [CVPR 2025] RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models
Official PyTorch implementation of [CVPR 2024] AM-RADIO: Agglomerative Vision Foundation Model - Reduce All Domains Into One
Check out our preprints: PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation and FeatSharp: Your Vision Model Features, Sharper.
Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing
[RADIOv2.5][FeatSharp][PHI-S][AM-RADIO][BibTex]
C-RADIOv4 Model Family (Commercially Permissive)
We've updated our teacher set to [SigLIP2-g-384, DINOv3-7B, SAM3], along with some other things, and the result is our strongest set of models to date. See our tech report for more details.
Loadable via torchhub (e.g. model_version='c-radio_v4-h' or model_version='c-radio_v4-so400m') or from HuggingFace:
- C-RADIOv4-SO400M
- C-RADIOv4-H
from PIL import Image
import torch
from torch.nn import functional as F
from torchvision.transforms.functional import pil_to_tensor
model_version="c-radio_v4-h" # for C-RADIOv3-H model (ViT-H/16)
# NOTE: `force_reload` will re-download the source code too. If you have used our TorchHub in the past, we strongly recommend
# running with this flag once to pull the latest code.
model = torch.hub.load('NVlabs/RADIO', 'radio_model', version=model_version, progress=True, skip_validation=True, force_reload=True)
model.cuda().eval()
x = Image.open('assets/cradio_v4.png').convert('RGB')
x = pil_to_tensor(x).to(dtype=torch.float32, device='cuda')
x.div_(255.0) # RADIO expects the input values to be between 0 and 1
x = x.unsqueeze(0) # Add a batch dimension
#### Example 1 ####
# Regular Usage
###################
nearest_res = model.get_nearest_supported_resolution(*x.shape[-2:])
x = F.interpolate(x, nearest_res, mode='bilinear', align_corners=False)
# RADIO expects the input to have values between [0, 1]. It will automatically normalize them to have mean 0 std 1.
summary, spatial_features = model(x)
#### Example 2 ####
# Returning features in NCHW format, for easier spatial handling
###################
# By default, RADIO will return the spatial_features in NLC format, with L being a combined height/width dimension.
# You can alternatively ask for the features in the more computer-vision-convenient format NCHW the following way:
summary, spatial_features = model(x, feature_fmt='NCHW')
assert spatial_features.ndim == 4
#### Example 3 ####
# AMP autocasting (mixed precision, critical for fast performance with self attention)
###################
# RADIO also supports running in mixed precision:
with torch.autocast('cuda', dtype=torch.bfloat16):
summary, spatial_features = model(x)
#### Example 4 ####
# Decoupled input normalization
###################
# If you'd rather pre-normalize the inputs, then you can do this:
conditioner = model.make_preprocessor_external()
# Now, the model won't change the inputs, and it's up to the user to call `cond_x = conditioner(x)` before
# calling `model(cond_x)`. You most likely would do this if you want to move the conditioning into your
# existing data processing pipeline.
with torch.autocast('cuda', dtype=torch.bfloat16):
cond_x = conditioner(x)
summary, spatial_features = model(cond_x)
#### Example 5 ####
# Teacher adaptors, e.g. for text alignment
###################
# Adaptors
# One or more may be specified via the `adaptor_names` argument
model = torch.hub.load('NVlabs/RADIO', 'radio_model', version=model_version, progress=True, skip_validation=True, adaptor_names=['siglip2-g'])
model.cuda().eval()
vis_output = model(x)
# These are the usual RADIO features
backbone_summary, backbone_features = vis_output['backbone']
# There will also be summary and feature pairs for each of the loaded adaptors
sig2_vis_summary, sig2_vis_features = vis_output['siglip2-g']
# The 'siglip2-g' and 'clip' adaptors (when available) are special because they also support text tokenization and encoding
sig2_adaptor = model.adaptors['siglip2-g']
text_input = sig2_adaptor.tokenizer(['An image of an alien wearing headphones, with three orbs floating overhead']).to('cuda')
text_tokens = sig2_adaptor.encode_text(text_input, normalize=True)
sim = F.cosine_similarity(sig2_vis_summary, text_tokens)
print(sim)
We also demonstrate how to use C-RADIOv4 to replace the vision encoder in SAM3 here: https://github.com/mranzinger/sam3-radio/blob/main/demo_sam3_radio.py
RADIO1D is a Vision Transformer variant that compresses spatial tokens into a variable-length 1D sequence of "global tokens" during encoding, and reconstructs the full spatial resolution via a decoder. The number of tokens can be chosen at inference time, providing a tunable trade-off between feature compactness and reconstruction fidelity.
A RADIO1D model exposes two named "necks":
- encoder — the compressed 1D global tokens, shape (B, num_tokens, C)
- decoder — the spatially-reconstructed features, shape (B, H*W, C)
Two new arguments on radio_model() / RADIOModel.forward() plumb this through:
- num_tokens: Optional[int] — number of tokens to keep in the 1D encoder output (default: model's max).
- neck_name: Optional[str] — which neck's output to return (default: returns a dict of all necks for multi-neck models).
You can qualitatively visualize the trade-off by sweeping num_tokens over a single image with examples/visualize_features.py:
python examples/visualize_features.py -v <radio1d_checkpoint> -d <image_dir> \
--neck decoder --animate-radio1d \
--radio1d-start 1 --radio1d-end 512 --radio1d-step 32
Pretrained checkpoints are not yet released.
C-RADIOv3 Model Family (Commercially Permissive)
Loadable via torchhub (e.g. model_version='c-radio_v3-h') or from HuggingFace:
- C-RADIOv3-B
- C-RADIOv3-L
- C-RADIOv3-H
- C-RADIOv3-g
Now, also supported as a Foundation Model in TAO Toolkit!
AM-RADIO is a framework to distill Large Vision Foundation models into a single one. RADIO, a new vision foundation model, excels across visual domains, serving as a superior replacement for vision backbones. Integrating CLIP variants, DINOv2, and SAM through distillation, it preserves unique features like text grounding and segmentation correspondence. Outperforming teachers in ImageNet zero-shot (+6.8%), kNN (+2.39%), and linear probing segmentation (+3.8%) and vision-language models (LLaVa 1.5 up to 1.5%), it scales to any resolution, supports non-square images. We offer an efficient variant, E-RADIO, which achieves is 6-10x faster than CLIP and DINOv2.

Models prefixed with C-RADIO are governed by the NVIDIA Open Model License, which enables commercial use cases.
Models prefixed with E-RADIO and RADIO are governed by the NSCL LICENSE file, which is non-commercial.
The latest model version is C-RADIOv3. We will update the description once new model is available.
The list of available versions, and some of their attributes, can be found in common.py. Those keys in the RESOURCE_MAP dictionary may be used as the version argument in torch.hub.load. Refer to the licensing section above for the use restrictions of particular models.
The C-RADIO (stands for Commercial RADIO) family of models are trained using different data that is commercially viable. Because of this, it enables us to release with the NVIDIA Open Model License which allows for commercial use cases.
Along with the TorchHub usage ab
$ claude mcp add RADIO \
-- python -m otcore.mcp_server <graph>