hub / github.com/facebookresearch/vjepa2

github.com/facebookresearch/vjepa2 @main sqlite

713 symbols 1,835 edges 87 files 153 documented · 21%

README

🆕 [2026-03-16]: :fire: V-JEPA 2.1 is released :fire: A new familly of models trained with a novel recipe that learns high quality and temporolly consistent dense features !!!

[2025-06-25]: V-JEPA 2 is released. [Blog]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Meta FAIR

Mahmoud Assran∗, Adrien Bardes∗, David Fan∗, Quentin Garrido∗, Russell Howes∗, Mojtaba Komeili∗, Matthew Muckley∗, Ammar Rizvi∗, Claire Roberts∗, Koustuv Sinha∗, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas

*Core Team

[Paper] [Blog] [BibTex]

Official Pytorch codebase for V-JEPA 2, V-JEPA 2-AC, V-JEPA 2.1.

V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

<img src="https://github.com/facebookresearch/vjepa2/raw/main/assets/flowchart.png" width=100%>

V-JEPA 2.1 Pre-training

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

[Paper] [BibTex]

V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features, as higlighted by PCA visualizations:

<img src="https://github.com/facebookresearch/vjepa2/raw/main/assets/teaser_screenshot_5dice.png" width=100%>

The V-JEPA 2.1 approach leverages: (1) Dense Predictive Loss, a masking-based self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the self-supervised training loss; (2) Deep Self-Supervision, which applies the self-supervised loss at multiple intermediate representations of the encoder models; (3) Multi-Modal Tokenizers for images and videos; and we show that our approach benefit from (4) Model and data scaling.

<img src="https://github.com/facebookresearch/vjepa2/raw/main/assets/architecture_vjepa2_1.jpg" width=100%>

V-JEPA 2.1 performance across dense and global prediction tasks:

<img src="https://github.com/facebookresearch/vjepa2/raw/main/assets/bars_teaser_tikz-1.png" width=100%>

V-JEPA 2 Pre-training

(Top) The encoder and predictor are pre-trained through self-supervised learning from video using a masked latent feature prediction objective, leveraging abundant natural videos to bootstrap physical world understanding and prediction. (Bottom) Performance of V-JEPA 2 on downstream understanding and prediction tasks.

Benchmark	V-JEPA 2	Previous Best
EK100	39.7%	27.6% (PlausiVL)
SSv2 (Probe)	77.3%	69.7% (InternVideo2-1B)
Diving48 (Probe)	90.2%	86.4% (InternVideo2-1B)
MVP (Video QA)	44.5%	39.9% (InternVL-2.5)
TempCompass (Video QA)	76.9%	75.3% (Tarsier 2)

V-JEPA 2-AC Post-training

(Top) After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. (Bottom) Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.

		Grasp	Pick-and-Place
Method	Reach	Cup	Box	Cup	Box
Octo	100%	10%	0%	10%	10%
Cosmos	80%	0%	20%	0%	0%
VJEPA 2-AC	100%	60%	20%	80%	50%

Models

V-JEPA 2 and V-JEPA 2.1

HuggingFace

See our HuggingFace collection for V-JEPA 2.

V-JEPA 2 Pretrained Checkpoints

Model	#Parameters	Resolution	Download Link	Pretraining Config
ViT-L/16	300M	256	checkpoint	configs
ViT-H/16	600M	256	checkpoint	configs
ViT-g/16	1B	256	checkpoint	configs
ViT-g/16₃₈₄	1B	384	checkpoint	configs

V-JEPA 2.1 Pretrained Checkpoints

Model	#Parameters	Resolution	Download Link	Pretraining Config
ViT-B/16	80M	384	checkpoint	configs
ViT-L/16	300M	384	checkpoint	configs
ViT-g/16	1B	384	checkpoint	configs
ViT-G/16	2B	384	checkpoint	configs

Pretrained backbones (via PyTorch Hub)

Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.

import torch

# preprocessor
processor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')
# models
# V-JEPA 2
vjepa2_vit_large = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_large')
vjepa2_vit_huge = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_huge')
vjepa2_vit_giant = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant')
vjepa2_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant_384')
# V-JEPA 2.1
vjepa2_1_vit_base_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_base_384')
vjepa2_1_vit_large_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_large_384')
vjepa2_1_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_giant_384')
vjepa2_1_vit_gigantic_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_gigantic_384')

Pretrained checkpoints on Huggingface

You can also use our pretrained checkpoints on Huggingface for V-JEPA 2.

from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitg-fpc64-256"
# facebook/vjepa2-vitl-fpc64-256
# facebook/vjepa2-vith-fpc64-256
# facebook/vjepa2-vitg-fpc64-256
# facebook/vjepa2-vitg-fpc64-384

model = AutoModel.from_pretrained(hf_repo)
processor = AutoVideoProcessor.from_pretrained(hf_repo)

Evaluation Attentive Probes

We share the trained attentive probes for two of our visual understanding evals (Something-Something v2 and Diving48) and the action anticipation eval EPIC-KITCHENS-100.

Model	SSv2	Diving48	EK100
	Checkpoint	Training Config	Inference Config	Result	Checkpoint	Training Config	Inference Config	Result	Checkpoint	Training Config	Inference Config	Result
ViT-L/16	checkpoint	config	config	73.7%	checkpoint	config	config	89.0%	checkpoint	config	config	32.7 R@5
ViT-g/16₃₈₄	checkpoint	config	config	77.3%	checkpoint	config	config	90.2%	checkpoint	config	config	39.7 R@5

V-JEPA 2-AC

Our action-conditioned checkpoint was trained from the ViT-g encoder.

Model	Download Link	Training Config
ViT-g/16	checkpoint	config

Pretrained action-conditioned backbone (via PyTorch Hub)

Please install Pytorch, timm and einops locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.

import torch

vjepa2_encoder, vjepa2_ac_predictor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_ac_vit_giant')

See energy_landscape_example.ipynb for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab. To run this notebook, you'll need to additionally install Jupyter and Scipy in your conda environment.

Getting Started

Setup

conda create -n vjepa2-312 python=3.12
conda activate vjepa2-312
pip install .  # or `pip install -e .` for development mode

Note to macOS users: V-JEPA 2 relies on decord, which does not support macOS (and, unfortunately, is also no longer under development). In order to run the V-JEPA 2 code on macOS, you will need a different decord implementation. We do not make specific recommendations, although some users have reported the use of eva-decord (see PR 1) or decord2 (see PR 31). We leave the selection of the decord package up to the user's

Core symbols most depended-on inside this repo

src/utils/schedulers.py

trunc_normal_

called by 14

src/utils/tensors.py

rotate_queries_or_keys

called by 14

src/models/utils/modules.py

src/utils/distributed.py

apply_masks

called by 13

src/masks/utils.py

Shape

Method 328

Function 268

Class 117

Languages

Python100%

Modules by API surface

src/datasets/utils/video/transforms.py58 symbols

src/datasets/utils/video/randaugment.py42 symbols

src/models/utils/modules.py40 symbols

app/vjepa_2_1/models/utils/modules.py32 symbols

src/datasets/utils/dataloader.py26 symbols

src/models/vision_transformer.py24 symbols

app/vjepa_2_1/models/vision_transformer.py24 symbols

evals/action_anticipation_frozen/epickitchens.py21 symbols

src/datasets/utils/weighted_sampler.py15 symbols

src/hub/backbones.py13 symbols

evals/video_classification_frozen/eval.py13 symbols

tests/models/test_models.py12 symbols

Dependencies from manifests, versioned

black26.3.1 · 1×

flake87.0.0 · 1×

isort5.13.2 · 1×

torch2 · 1×

For agents

$ claude mcp add vjepa2 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact