MCPcopy
hub / github.com/mlfoundations/open_clip

github.com/mlfoundations/open_clip @v3.3.0 sqlite

repository ↗ · DeepWiki ↗ · release v3.3.0 ↗
476 symbols 1,525 edges 42 files 91 documented · 19%
README

OpenCLIP

[Paper] [Citations] [Clip Colab] [Coca Colab] pypi

Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

Using this codebase, we have trained several models on a variety of data sources and compute budgets, ranging from small-scale experiments to larger runs including models trained on datasets such as LAION-400M, LAION-2B and DataComp-1B. Many of our models and their scaling properties are studied in detail in the paper reproducible scaling laws for contrastive language-image learning. Some of the best models we've trained and their zero-shot ImageNet-1k accuracy are shown below, along with the ViT-L model trained by OpenAI and other state-of-the-art open source alternatives (all can be loaded via OpenCLIP). We provide more details about our full collection of pretrained models here, and zero-shot results for 38 datasets here.

Model Training data Resolution # of samples seen ImageNet zero-shot acc.
ConvNext-Base LAION-2B 256px 13B 71.5%
ConvNext-Large LAION-2B 320px 29B 76.9%
ConvNext-XXLarge LAION-2B 256px 34B 79.5%
ViT-B-32-256 DataComp-1B 256px 34B 72.8%
ViT-B-16 DataComp-1B 224px 13B 73.5%
ViT-L-14 LAION-2B 224px 32B 75.3%
ViT-H-14 LAION-2B 224px 32B 78.0%
ViT-L-14 DataComp-1B 224px 13B 79.2%
ViT-bigG-14 LAION-2B 224px 34B 80.1%
ViT-L-14-quickgelu (Original CLIP) WIT 224px 13B 75.5%
ViT-SO400M-14-SigLIP (SigLIP) WebLI 224px 45B 82.0%
ViT-L-14 (DFN) DFN-2B 224px 39B 82.2%
ViT-L-16-256 (SigLIP2) WebLI (multi-lang) 256px 40B 82.5%
ViT-SO400M-14-SigLIP-384 (SigLIP) WebLI 384px 45B 83.1%
ViT-H-14-quickgelu (DFN) DFN-5B 224px 39B 83.4%
PE-Core-L-14-336 (PE) MetaCLIP-5.4B 336px 58B 83.5%
ViT-SO400M-16-SigLIP2-384 (SigLIP2) WebLI (multi-lang) 384px 40B 84.1%
ViT-H-14-378-quickgelu (DFN) DFN-5B 378px 44B 84.4%
ViT-gopt-16-SigLIP2-384 (SigLIP2) WebLI (multi-lang) 384px 40B 85.0%
PE-Core-bigG-14-448 (PE) MetaCLIP-5.4B 448px 86B 85.4%

Model cards with additional model specific details can be found on the Hugging Face Hub under the OpenCLIP library tag: https://huggingface.co/models?library=open_clip.

If you found this repository useful, please consider citing. We welcome anyone to submit an issue or send an email if you have any other requests or suggestions.

Note that portions of src/open_clip/ modelling and tokenizer code are adaptations of OpenAI's official repository.

Approach

CLIP
Image Credit: https://github.com/openai/CLIP

Usage

pip install open_clip_torch
import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
model.eval()  # model in train mode by default, impacts some models with BatchNorm or stochastic depth active
tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

If model uses timm image encoders (convnext, siglip, eva, etc) ensure the latest timm is installed. Upgrade timm if you see 'Unknown model' errors for the image encoder.

If model uses transformers tokenizers, ensure transformers is installed.

See also this [Clip Colab].

To compute billions of embeddings efficiently, you can use clip-retrieval which has openclip support.

Pretrained models

We offer a simple model interface to instantiate both pre-trained and untrained models. To see which pretrained models are available, use the following code snippet. More details about our pretrained models are available here.

>>> import open_clip
>>> open_clip.list_pretrained()

You can find more about the models we support (e.g. number of parameters, FLOPs) in this table.

NOTE: Many existing checkpoints use the QuickGELU activation from the original OpenAI models. This activation is actually less efficient than native torch.nn.GELU in recent versions of PyTorch. The model defaults are now nn.GELU, so one should use model definitions with -quickgelu postfix for the OpenCLIP pretrained weights. All OpenAI pretrained weights will always default to QuickGELU. One can also use the non -quickgelu model definitions with pretrained weights using QuickGELU but there will be an accuracy drop, for fine-tune that will likely vanish for longer runs. Future trained models will use nn.GELU.

Loading models

Models can be loaded with open_clip.create_model_and_transforms, as shown in the example below. The model name and corresponding pretrained keys are compatible with the outputs of open_clip.list_pretrained().

The pretrained argument also accepts local paths, for example /path/to/my/b32.pt. You can also load checkpoints from huggingface this way. To do so, download the open_clip_pytorch_model.bin file (for example, https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/tree/main), and use pretrained=/path/to/open_clip_pytorch_model.bin.

# pretrained also accepts local paths
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k') 

Fine-tuning on classification tasks

This repository is focused on training CLIP models. To fine-tune a trained zero-shot model on a downstream classification task such as ImageNet, please see our other repository: WiSE-FT. The WiSE-FT repository contains code for our paper on Robust Fine-tuning of Zero-shot Models, in which we introduce a technique for fine-tuning zero-shot models while preserving robustness under distribution shift.

Data

To download datasets as webdataset, we recommend img2dataset.

Conceptual Captions

See cc3m img2dataset example.

YFCC and other datasets

In addition to specifying the training data via CSV files as mentioned above, our codebase also supports webdataset, which is recommended for larger scale datasets. The expected format is a series of .tar files. Each of these .tar files should contain two files for each training example, one for the image and one for the corresponding text. Both files should have the same name but different extensions. For instance, shard_001.tar could contain files such as abc.jpg and abc.txt. You can learn more about webdataset at https://github.com/webdataset/webdataset. We use .tar files with 1,000 data points each, which we create using tarp.

You can download the YFCC dataset from Multimedia Commons. Similar to OpenAI, we used a subset of YFCC to reach the aforementioned accuracy numbers. The indices of images in this subset are in OpenAI's CLIP repository.

Training CLIP

Install

We advise you first create a virtual environment with:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

You can then install openclip for training with pip install 'open_clip_torch[training]'.

Development

If you want to make changes to contribute code, you can clone openclip then run make install in openclip folder (after creating a virtualenv)

Install pip PyTorch as per https://pytorch.org/get-started/locally/

You may run make install-training to install training deps

Testing

Test can be run with make install-test then make test

python -m pytest -x -s -v tests -k "training" to run a specific test

Running regression tests against a specific git revision or tag: 1. Generate testing data sh python tests/util_test.py --model RN50 RN101 --save_model_list models.txt --git_revision 9d31b2ec4df6d8228f370ff20c8267ec6ba39383 WARNING: This will invoke git and modify your working tree, but will reset it to the current state after data has been generated! \ Don't modify your working tree while test data is being generated this way.

  1. Run regression tests sh OPEN_CLIP_TEST_REG_MODELS=models.txt python -m pytest -x -s -v -m regression_test

Sample single-process running code:

python -m open_clip_train.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Note: imagenet-val is the path to the validation set of ImageNet for zero-shot evaluation, not the training set! You can remove this argument if you do not want to perform zero-shot evaluation on ImageNet throughout training. Note that the val folder should contain subfolders. If it does not, please use this script.

Multi-GPU and Beyond

This code has been battle tested up to 1024 A100s and offers a variety of solutions for distributed training. We include native support for SLURM clusters.

As the number of devices used to train increases, so does the space complexity of the the logit matrix. Using a naïve all-gather scheme, space complexity will be O(n^2). Instead, complexity may become effectively linear if the flags --gather-with-grad and --local-loss are used. This alteration results in one-to-one numerical results as the naïve method.

Epochs

For larger datasets (eg Laion2B), we recommend setting --train-num-samples to a lower value than the full epoch, for example --train-num-samples 135646078 to 1/16 of an epoch in conjunction with --dataset-resampled to do sampling with replacement. This allows having frequent checkpoints to evaluate more often.

Patch Dropout

Recent research has shown that one can dropout half to three-quarters of the visual tokens, leading to up to 2-3x training speeds without loss of accuracy.

You can set this on your visual transformer config with the key patch_dropout.

In the paper, they also finetuned without the patch dropout at the end. You can do this with the command-line argument --force-patch-dropout 0.

Multiple data sources

OpenCLIP supports using multiple data sources, by separating different data paths with ::. For instance, to train on CC12M and on LAION, one might use --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar::/data/LAION-400M/{00000..41455}.tar". Using --dataset-resampled is recommended for these cases.

By default, on expectation the amount of times the model will see a sample from each source is proportional to the size of the source. For instance, when training on one data source with size 400M and one with size 10M, samples from the first source are 40x more likely to be seen in expectation.

We also support different weighting of the data sources, by using the --train-data-upsampling-factors flag. For instance, using --train-data-upsampling-factors=1::1 in the above scenario is equivalent to not using the flag, and --train-data-upsampling-factors=1::2 is equivalent to upsampling the second data source twice. If you want to sample from data sources with the same frequency, the upsampling factors should be inversely proportional to the sizes of the data sources. For instance, if dataset A has 1000 samples and dataset B has 100 sample

Core symbols most depended-on inside this repo

_pcfg
called by 117
src/open_clip/pretrained.py
_n2p
called by 47
src/open_clip/convert.py
_slpcfg
called by 30
src/open_clip/pretrained.py
update
called by 19
src/open_clip_train/train.py
is_master
called by 13
src/open_clip_train/distributed.py
download_pretrained_from_url
called by 8
src/open_clip/pretrained.py
decode
called by 7
src/open_clip/tokenizer.py
_apcfg
called by 7
src/open_clip/pretrained.py

Shape

Function 215
Method 196
Class 58
Route 7

Languages

Python100%

Modules by API surface

src/open_clip/transformer.py76 symbols
src/open_clip/model.py38 symbols
src/open_clip/transform.py35 symbols
src/open_clip_train/data.py34 symbols
src/open_clip/tokenizer.py34 symbols
src/open_clip/loss.py28 symbols
src/open_clip/hf_model.py20 symbols
src/open_clip/pretrained.py18 symbols
src/open_clip/factory.py17 symbols
tests/util_test.py16 symbols
tests/test_download_pretrained.py16 symbols
src/open_clip/modified_resnet.py15 symbols

Dependencies from manifests, versioned

pytest7.2.0 · 1×
pytest-split0.8.0 · 1×
requests2.32.5 · 1×
timm1.0.17 · 1×
torch2.0 · 1×
webdataset0.2.5 · 1×

For agents

$ claude mcp add open_clip \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact