MCPcopy
hub / github.com/facebookresearch/jepa

github.com/facebookresearch/jepa @main sqlite

repository ↗ · DeepWiki ↗
340 symbols 816 edges 38 files 84 documented · 25%
README

V-JEPA: Video Joint Embedding Predictive Architecture

Official PyTorch codebase for the video joint-embedding predictive architecture, V-JEPA, a method for self-supervised learning of visual representations from video.

Meta AI Research, FAIR

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas

[Blog] [Paper] [Yannic Kilcher's Video]

V-JEPA models are trained by passively watching video pixels from the VideoMix2M dataset, and produce versatile visual representations that perform well on downstream video and image tasks, without adaption of the model’s parameters; e.g., using a frozen backbone and only a light-weight task-specific attentive probe.

Method

V-JEPA pretraining is based solely on an unsupervised feature prediction objective, and does not utilize pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction.

     

Visualizations

As opposed to generative methods that have a pixel decoder, V-JEPA has a predictor that makes predictions in latent space. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.

The V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video.

MODEL ZOO

Pretrained models

model patch size resolution iterations batch size data download
ViT-L 2x16x16 224x224 90K 3072 VideoMix2M checkpoint configs
ViT-H 2x16x16 224x224 90K 3072 VideoMix2M checkpoint configs
ViT-H 2x16x16 384x384 90K 2400 VideoMix2M checkpoint configs

K400 Attentive probes

model resolution accuracy (16x8x3) download
ViT-L/16 224x224 80.8 attentive probe checkpoint configs
ViT-H/16 224x224 82.0 attentive probe checkpoint configs
ViT-H/16 384x384 81.9 attentive probe checkpoint configs

SSv2 Attentive probes

model resolution accuracy (16x2x3) download
ViT-L/16 224x224 69.5 attentive probe checkpoint configs
ViT-H/16 224x224 71.4 attentive probe checkpoint configs
ViT-H/16 384x384 72.2 attentive probe checkpoint configs

ImageNet1K Attentive probes

model resolution accuracy download
ViT-L/16 224x224 74.8 attentive probe checkpoint configs
ViT-H/16 224x224 75.9 attentive probe checkpoint configs
ViT-H/16 384x384 77.4 attentive probe checkpoint configs

Places205 Attentive probes

model resolution accuracy download
ViT-L/16 224x224 60.3 attentive probe checkpoint configs
ViT-H/16 224x224 61.7 attentive probe checkpoint configs
ViT-H/16 384x384 62.8 attentive probe checkpoint configs

iNat21 Attentive probes

model resolution accuracy download
ViT-L/16 224x224 67.8 attentive probe checkpoint configs
ViT-H/16 224x224 67.9 attentive probe checkpoint configs
ViT-H/16 384x384 72.6 attentive probe checkpoint configs

Code Structure

Config files: All experiment parameters are specified in config files (as opposed to command-line arguments). See the configs/ directory for example config files. Note, before launching an experiment, you must update the paths in the config file to point to your own directories, indicating where to save the logs and checkpoints and where to find the training data.

.
├── app                       # the only place where training loops are allowed
│   ├── vjepa                 #   Video JEPA pre-training
│   ├── main_distributed.py   #   entrypoint for launching app on slurm cluster
│   └── main.py               #   entrypoint for launching app locally on your machine for debugging
├── evals                     # the only place where evaluation of 'apps' are allowed
│   ├── image_classification  #   training an attentive probe for image classification with frozen backbone
│   ├── video_classification  #   training an attentive probe for video classification with frozen backbone
│   ├── main_distributed.py   #   entrypoint for launching distributed evaluations on slurm cluster
│   └── main.py               #   entrypoint for launching evaluations locally on your machine for debugging
├── src                       # the package
│   ├── datasets              #   datasets, data loaders, ...
│   ├── models                #   model definitions
│   ├── masks                 #   mask collators, masking utilities, ...
│   └── utils                 #   shared utilities
└── configs                   # the only place where config files are allowed (specify experiment params for app/eval runs)
    ├── evals                 #   configs for launching vjepa frozen evaluations
    └── pretrain              #   configs for launching vjepa pretraining

Data preparation

Video Datasets

V-JEPA pretraining and evaluations work with many standard video formats. To make a video dataset compatible with the V-JEPA codebase, you simply need to create a .csv file with the following format and then specify the path to this CSV file in your config.

/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
/absolute_file_path.[mp4, webvid, etc.] $integer_class_label
...

Since V-JEPA is entirely unsupervised, the pretraining code will disregard the $integer_class_label in the CSV file. Thus, feel free to put a random value in this column. However, if you wish to run a supervised video classification evaluation on your video dataset, you must replace $integer_class_label with the ground truth label for each video.

Image Datasets

We use the standard PyTorch ImageFolder class in our image classification evals. Thus, to set up an image dataset for the image classification evaluation, first create a directory to store your image datasets $your_directory_containing_image_datasets. Next, download your image datasets into this directory in a format compatible with PyTorch ImageFolder.

For example, suppose we have a directory called my_image_datasets. We would then download our image datasets into this directory so that we end up with the following file tree ``` . └── /my_image_datasets/ # where we store image datasets ├── places205/121517/pytorch/ # Places205 │ └── [...] ├── iNaturalist-2021/110421/ # iNaturalist21 │ └── [...] ├── [...] # Other Image Datasets │ └── [...] └── imagenet_full_size/061417/ # ImageNet1k └── train │ ├── $class_1 │ │ ├── xxx.[png, jpeg, etc.] │ │ ├── [...] │ │ └── xxz.[png, jpeg, etc.] │ ├── [...] │ └── $class_n │ ├── abc.[png, jpeg, etc.] │ ├── [...] │ └── abz.[png, jpeg, etc.] └── val ├── $class_1 │ ├── xxx.[png, jpeg, etc.] │ ├── [...] │ └── xxz

Core symbols most depended-on inside this repo

step
called by 19
src/masks/random_tube.py
update
called by 16
src/utils/logging.py
log
called by 9
src/utils/logging.py
trunc_normal_
called by 9
src/utils/tensors.py
apply_masks
called by 7
src/masks/utils.py
_check_args_tf
called by 7
src/datasets/utils/video/randaugment.py
backward
called by 6
src/utils/distributed.py
get_1d_sincos_pos_embed_from_grid
called by 6
src/models/utils/pos_embs.py

Shape

Method 149
Function 134
Class 57

Languages

Python100%

Modules by API surface

src/datasets/utils/video/transforms.py56 symbols
src/datasets/utils/video/randaugment.py42 symbols
src/models/vision_transformer.py17 symbols
src/models/utils/modules.py15 symbols
evals/video_classification_frozen/utils.py14 symbols
src/datasets/utils/weighted_sampler.py12 symbols
src/utils/monitoring.py11 symbols
src/utils/logging.py11 symbols
src/utils/distributed.py10 symbols
src/masks/multiblock3d.py10 symbols
src/models/predictor.py9 symbols
src/models/attentive_pooler.py9 symbols

Dependencies from manifests, versioned

torch2 · 1×

For agents

$ claude mcp add jepa \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact