MCPcopy Index your code
hub / github.com/apple/ml-fastvlm

github.com/apple/ml-fastvlm @main sqlite

repository ↗ · DeepWiki ↗
358 symbols 1,185 edges 39 files 72 documented · 20%
README

FastVLM: Efficient Vision Encoding for Vision Language Models

This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)

Accuracy vs latency figure.

Highlights

  • We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
  • Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
  • Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
  • Demo iOS app to demonstrate the performance of our model on a mobile device.
FastVLM - Counting FastVLM - Handwriting FastVLM - Emoji

Getting Started

We use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants, please follow instructions provided in LLaVA codebase. We provide instructions for running inference with our models.

Setup

conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .

Model Zoo

For detailed information on various evaluations, please refer to our paper.

Model Stage Pytorch Checkpoint (url)
FastVLM-0.5B 2 fastvlm_0.5b_stage2
3 fastvlm_0.5b_stage3
FastVLM-1.5B 2 fastvlm_1.5b_stage2
3 fastvlm_1.5b_stage3
FastVLM-7B 2 fastvlm_7b_stage2
3 fastvlm_7b_stage3

To download all the pretrained checkpoints run the command below (note that this might take some time depending on your connection so might be good to grab ☕️ while you wait).

bash get_models.sh   # Files will be downloaded to `checkpoints` directory.

Usage Example

To run inference of PyTorch checkpoint, follow the instruction below

python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."

Inference on Apple Silicon

To run inference on Apple Silicon, pytorch checkpoints have to be exported to format suitable for running on Apple Silicon, detailed instructions and code can be found model_export subfolder. Please see the README there for more details.

For convenience, we provide 3 models that are in Apple Silicon compatible format: fastvlm_0.5b_stage3, fastvlm_1.5b_stage3, fastvlm_7b_stage3. We encourage developers to export the model of their choice with the appropriate quantization levels following the instructions in model_export.

Inference on Apple Devices

To run inference on Apple devices like iPhone, iPad or Mac, see app subfolder for more details.

Citation

If you found this code useful, please cite the following paper:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025},
}

Acknowledgements

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

License

Please check out the repository LICENSE before using the provided code and LICENSE_MODEL for the released models.

Core symbols most depended-on inside this repo

tokenizer_image_token
called by 35
llava/mm_utils.py
copy
called by 18
llava/conversation.py
append_message
called by 17
llava/conversation.py
get_model
called by 14
llava/model/llava_arch.py
get_prompt
called by 13
llava/conversation.py
to_gradio_chatbot
called by 12
llava/conversation.py
get_vision_tower
called by 9
llava/model/llava_arch.py
get_vision_tower
called by 7
llava/model/llava_arch.py

Shape

Method 176
Function 118
Class 53
Route 11

Languages

Python100%

Modules by API surface

llava/model/multimodal_encoder/mobileclip/mci.py58 symbols
llava/train/train_qwen.py35 symbols
llava/train/train.py34 symbols
llava/serve/controller.py30 symbols
llava/model/multimodal_encoder/clip_encoder.py19 symbols
llava/serve/sglang_worker.py15 symbols
llava/mm_utils.py15 symbols
llava/train/llava_trainer.py14 symbols
llava/serve/model_worker.py14 symbols
llava/serve/gradio_web_server.py13 symbols
llava/model/multimodal_encoder/mobileclip_encoder.py13 symbols
llava/model/llava_arch.py12 symbols

Dependencies from manifests, versioned

accelerate1.6.0 · 1×
bitsandbytes
sentencepiece0.1.99 · 1×
shortuuid
tokenizers0.21.0 · 1×
torch2.6.0 · 1×
torchvision0.21.0 · 1×
transformers4.48.3 · 1×

For agents

$ claude mcp add ml-fastvlm \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact