MCPcopy
hub / github.com/facebookresearch/MetaCLIP

github.com/facebookresearch/MetaCLIP @v0.3 sqlite

repository ↗ · DeepWiki ↗ · release v0.3 ↗
398 symbols 1,606 edges 80 files 28 documented · 7%
README

Meta CLIP

FAIR, Meta

arXiv arXiv Hugging Face Collection Open In Colab Hugging Face Spaces

After years of advancements in English-centric CLIP development, Meta CLIP 2 is now taking the next step: scaling CLIP to worldwide data. The effort addresses long-standing challenges: - large-scale non-English data curation pipelines are largely undeveloped; - the curse of multilinguality, where English performance often degrades in multilingual CLIP compared to English-only CLIP.

With a complete recipe for worldwide CLIP—spanning data curation, modeling, and training—we show that English and non-English worlds can mutually benefit and elevate each other, achieving SoTA multilingual performance.

Updates

Quick Start

The pre-trained MetaCLIP models are available in

mini_clip (this repo)

import torch
from PIL import Image
from src.mini_clip.factory import create_model_and_transforms, get_tokenizer


model, _, preprocess = create_model_and_transforms('ViT-H-14-quickgelu-worldwide@WorldWideCLIP', pretrained='metaclip2_worldwide')
tokenize = get_tokenizer("facebook/xlm-v-base")

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Huggingface

from PIL import Image
from transformers import AutoProcessor, AutoModel


# Meta CLIP 1
processor = AutoProcessor.from_pretrained("facebook/metaclip-b32-400m")
model = AutoModel.from_pretrained("facebook/metaclip-b32-400m")

# Meta CLIP 2
# model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
# processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")

image = Image.open("docs/CLIP.png")
inputs = processor(text=["a diagram", "a dog", "a cat"], images=image, return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
  text_probs = logits_per_image.softmax(dim=-1)
print("Label probs:", text_probs)

Pre-trained Models

Meta CLIP closely adhere to OpenAI CLIP training and model setup (you mostly just need to replace the weights): to promote rigorous ablation studies and advance scientific understanding, as in the old "era of ImageNet".

Meta CLIP 2

model_name pretrained Data Card # of Seen Pairs Res. CVQA-LOCAL ZS Acc.
ViT-H-14-quickgelu-worldwide metaclip2_worldwide Online Curation 29B 224 57.4
ViT-H-14-378-worldwide metaclip2_worldwide Online Curation 29B 378 58.2
ViT-bigG-14-worldwide metaclip2_worldwide Online Curation 29B 224 60.7
ViT-bigG-14-378-worldwide metaclip2_worldwide Online Curation 29B 378 62.0

(WIP): Meta CLIP 2: distilled smaller models and tokenizers.

Meta CLIP 1

model_name pretrained Data Card # of Seen Pairs Res. GPUs IN ZS Acc.
ViT-B-32-quickgelu metaclip_400m data card 12.8B 224 64 x V100 65.5
ViT-B-16-quickgelu metaclip_400m data card 12.8B 224 64 x V100 70.8
ViT-L-14-quickgelu metaclip_400m data card 12.8B 224 128 x V100 76.2
ViT-B-32-quickgelu metaclip_2_5b data card 12.8B 224 64 x V100 67.6
ViT-B-16-quickgelu metaclip_2_5b data card 12.8B 224 64 x V100 72.1
ViT-L-14-quickgelu metaclip_2_5b data card 12.8B 224 128 x V100 79.2
ViT-H-14-quickgelu metaclip_2_5b data card 12.8B 224 256 x A100 80.5
ViT-bigG-14-quickgelu (v1.1) metaclip_2_5b data card 12.8B 224 256 x A100 82.1
ViT-H-14 (v1.2) metaclip_v1_2_altogether Online Curation 35B 224 256 x H100 82.0

Environment

This code is customized from OpenCLIP and will be maintained separately for research on MetaCLIP. The following command should install requirements for OpenCLIP and submitit=1.2.1 used by this repo:

conda create -n metaclip python=3.10 pytorch torchvision pytorch-cuda=11.7 tqdm ftfy braceexpand regex pandas submitit=1.2.1 \
    -c pytorch-nightly \
    -c nvidia \
    -c conda-forge \
    -c anaconda

Curation

See MetaCLIP 2 and MetaCLIP 1.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Hu Xu (huxu@meta.com).

Citation

Please cite the following paper if MetaCLIP helps your work:

```bibtex
@inproceedings{chuang2025metaclip2,
   title={Meta CLIP 2: A Worldwide Scaling Recipe},
   author={Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu},
   journal={arXiv preprint arXiv:2507.22062},
   year={2025}
}

@inproceedings{xu2023metaclip,
   title={Demystifying CLIP Data},
   author={Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2309.16671},
   year={2023}
}

@inproceedings{xu2024altogether,
   title={Altogether: Image Captioning via Re-aligning Alt-text},
   author={Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer},
   journal={arXiv preprint arXiv:2410.17251},
   year={2024}
}

@inproceedings{ma2024mode,
  title={Mode: Clip data experts via clustering},
  author={Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih and Hu Xu},
  booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  year={2024}
}

Reference

The training code is developed based on OpenCLIP, modified to the vanilla CLIP training setup.

TODO

  • pip installation of metaclip package;
  • refactor mini_clip with apps for MoDE, altogether.
  • more updates for Meta CLIP 2: metadata, data loader, training code.

License

The majority of Meta CLIP is licensed under CC-BY-NC, however portions of the project are available under separate license terms: open_clip is licensed under the https://github.com/mlfoundations/open_clip license.

Acknowledgement

We gratefully acknowledge the OpenCLIP team for initial CLIP codebase and integration and NielsRogge's integration into Huggingface.

Core symbols most depended-on inside this repo

update
called by 19
src/training/train.py
is_master
called by 13
src/training/distributed.py
decode
called by 11
src/mini_clip/tokenizer.py
encode_image
called by 8
src/mini_clip/model.py
world_info_from_env
called by 8
src/training/distributed.py
kcc_type
called by 7
metaclip/metadata/lang_tokenizers/km.py
encode
called by 7
src/mini_clip/tokenizer.py
search_config
called by 6
configs.py

Shape

Function 206
Method 131
Class 61

Languages

Python100%

Modules by API surface

src/mini_clip/model.py46 symbols
src/mini_clip/model_altogether.py24 symbols
metaclip/metaclip1/cc_matching.py21 symbols
mode/move2train/mode_wds.py12 symbols
src/training/train.py11 symbols
src/mini_clip/tokenizer.py11 symbols
metaclip/metadata/build_ngram.py11 symbols
submit.py10 symbols
mode/move2root/submitit_mode.py10 symbols
metaclip/metadata/lang_tokenizers/km.py10 symbols
metaclip/curation/parse_wat.py10 symbols
altogether/infer.py9 symbols

For agents

$ claude mcp add MetaCLIP \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact