MCPcopy Index your code
hub / github.com/OpenGVLab/InternVL

github.com/OpenGVLab/InternVL @v1.5.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.5.0 ↗
2,160 symbols 7,016 edges 355 files 375 documented · 17%
README

image InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V

[Update Blog] [Paper] [InternVL 1.5 Technical Report] [Chat Demo] [HuggingFace Demo] [Quick Start] [中文解读]

News🚀🚀🚀

  • 2024/04/28: We release the INT8 version of InternVL-Chat-V1-5, see HF link.
  • 2024/04/28: We achieve the SOTA performance (75.74) on the Infographics VQA benchmark, see here.
  • 2024/04/18: InternVL-Chat-V1.5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
  • 2024/02/27: InternVL is accepted by CVPR 2024! 🎉
  • 2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
  • 2024/02/21: InternVL-Chat-V1.2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
  • 2024/02/12: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog, SFT data or try our demo. The model is now available on HuggingFace, and both training/evaluation data and scripts are open-sourced.
  • 2024/02/04: InternVL-Chat-V1.1 achieves 44.67% on MMVP, higher than GPT-4V!
  • 2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, see here.
  • 2024/01/24: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see here or try our demo.
  • 2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.

Documents

  • How to install InternVL? [link]
  • How to fine-tune InternVL? [link]
  • How to evaluate InternVL-Chat-V1-5? [link]
  • How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend) [link]
  • How to deploy a local demo? [link]
  • How to run InternVL 1.5-8bit with Nvidia V100 GPU? [link]

Compared with SOTA VLLMs

image

image

image

What is InternVL?

InternVL scales up the ViT to 6B parameters and aligns it with LLM.

Model Zoo

Vision Large Language Model

Model Date Download Note
InternVL−Chat−V1.5-Int8 2024.04.28 🤗 HF link The INT8 version of InternVL-Chat-V1-5
InternVL−Chat−V1.5 2024.04.18 🤗 HF link support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)
InternVL−Chat−V1.2−Plus 2024.02.21 🤗 HF link more SFT data and stronger
InternVL−Chat−V1.2 2024.02.11 🤗 HF link scaling up LLM to 34B
InternVL−Chat−V1.1 2024.01.24 🤗 HF link support Chinese and stronger OCR
InternVL−Chat−19B−448px 2024.02.03 🤗 HF link 448 resolution
InternVL−Chat−19B 2023.12.25 🤗 HF link English multimodal dialogue
InternVL−Chat−13B 2023.12.25 🤗 HF link English multimodal dialogue

Vision-Language Foundation Model

Model Date Download Note
InternViT−6B−448px−V1.5 2024.04.20 🤗 HF link support dynamic resolution, super strong OCR (🔥new)
InternViT−6B−448px−V1.2 2024.02.11 🤗 HF link 448 resolution
InternViT−6B−448px−V1.0 2024.01.30 🤗 HF link 448 resolution
InternViT−6B−224px 2023.12.22 🤗 HF link vision foundation model
InternVL−14B−224px 2023.12.22 🤗 HF link vision-language foundation model

What can InternVL do?

Visual Perception (click to expand)

ViT-22B uses the private JFT-3B dataset.

method #param IN-1K IN-ReaL IN-V2 IN-A IN-R IN-Sketch
OpenCLIP-G 1.8B 86.2 89.4 77.2 63.8 87.8 66.4
DINOv2-g 1.1B 86.5 89.6 78.4 75.9 78.8 62.5
EVA-01-CLIP-g 1.1B 86.5 89.3 77.4 70.5 87.7 63.1
MAWS-ViT-6.5B 6.5B 87.8 - - - - -
ViT-22B* 21.7B 89.5 90.9 83.2 83.8 87.4
InternViT-6B (ours) 5.9B 88.2 90.4 79.9 77.5 89.8 69.1
method decoder #param (train/total) crop size mIoU
OpenCLIP-G (frozen) Linear 0.3M / 1.8B 512 39.3
ViT-22B (frozen) Linear 0.9M / 21.7B 504 34.6
InternViT-6B (frozen) Linear 0.5M / 5.9B 504 47.2 (+12.6)
ViT-22B (frozen) UperNet 0.8B / 22.5B 504 52.7
InternViT-6B (frozen) UperNet 0.4B / 6.3B 504 54.9 (+2.2)
ViT-22B UperNet 22.5B / 22.5B 504 55.3
InternViT-6B UperNet 6.3B / 6.3B 504 58.9 (+3.6)
method IN-1K IN-A IN-R IN-V2 IN-Sketch ObjectNet
OpenCLIP-G 80.1 69.3 92.1 73.6 68.9 73.0
EVA-02-CLIP-E+ 82.0 82.1 94.5 75.7 71.6 79.6
ViT-22B* 85.9 90.1 96.0 80.9 87.6
InternVL-C (ours) 83.2 83.8 95.5 77.3 73.9 80.6

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method IN-1K (EN) IN-1K (ZH) IN-1K (JP) IN-1K (AR) IN-1K (IT)
Taiyi-CLIP-ViT-H - 54.4 - - -
WuKong-ViT-L-G - 57.5 - - -
CN-CLIP-ViT-H - 59.6 - - -
AltCLIP-ViT-L 74.5 59.6 - - -
EVA-02-CLIP-E+ 82.0 - - - 41.2
OpenCLIP-XLM-R-H 77.0 55.7 53.1 37.0 56.8
InternVL-C (ours) 83.2 64.5 61.5 44.9 65.7
  • Zero-Shot Video Classification [see details]
method #frame K400 K600 K700
OpenCLIP-G 1 65.9 66.1 59.2
EVA-02-CLIP-E+ 1 69.8 69.3 63.4
InternVL-C (ours) 1 71.0 71.3 65.7
ViCLIP 8 75.7 73.5 66.4
InternVL-C (ours) 8 79.4 78.8 71.5

Cross-Modal Retrieval (click to expand)

model Flickr30K COCO avg
image-to-text text-to-image image-to-text text-to-image
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
OpenCLIP-G 92.9 99.3 99.8 79.5 95.0 97.1 67.3 86.9 92.6 51.4 74.9 83.0 85.0
EVA-02-CLIP-E+ 93.9 99.4 99.8 78.8 94.2 96.8 68.8 87.8 92.8 51.1 75.0 82.7 85.1
EVA-CLIP-8B 95.6 99.6 99.9 80.8 95.5 97.6 70.3 89.3 93.9 53.0 76.0 83.4

Core symbols most depended-on inside this repo

to
called by 310
clip_benchmark/clip_benchmark/models/japanese_clip.py
register_conv_template
called by 66
internvl_chat/internvl/conversation.py
update
called by 66
classification/utils.py
from_pretrained
called by 58
internvl_chat/internvl/model/internvl_chat/configuration_intern_vit.py
from_pretrained
called by 44
internvl_chat_llava/llava/model/language_model/mpt/adapt_tokenizer.py
tokenizer_image_token
called by 36
internvl_chat_llava/llava/mm_utils.py
append_message
called by 27
internvl_chat/internvl/conversation.py
state_dict
called by 27
classification/utils.py

Shape

Method 1,142
Function 631
Class 370
Route 17

Languages

Python100%
TypeScript1%

Modules by API surface

internvl_chat/internvl/model/internlm2/modeling_internlm2.py71 symbols
internvl_chat_llava/llava/model/multimodal_encoder/eva_clip/modeling_evaclip.py66 symbols
internvl_g/internvl/model/internvl_stage2_retrieval/modeling_qllama.py52 symbols
internvl_g/internvl/model/internvl_stage2/modeling_qllama.py52 symbols
internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_qllama.py52 symbols
clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_qllama.py52 symbols
internvl_chat_llava/llava/train/train_custom.py41 symbols
classification/models/intern_vit_6b.py38 symbols
internvl_g/internvl/model/internvl_stage2_retrieval/modeling_internvl.py37 symbols
internvl_g/internvl/model/internvl_stage2/modeling_internvl.py37 symbols
clip_benchmark/clip_benchmark/models/internvl_c_pytorch/internvl_c.py36 symbols
classification/dataset/cached_image_folder.py36 symbols

Dependencies from manifests, versioned

bitsandbytes0.41.0 · 1×
open_clip_torch0.2.1 · 1×
peft0.4.0 · 1×
protobuf3.20.3 · 1×
scikit-learn1.0 · 1×
sentencepiece0.1.99 · 1×
shortuuid
tensorflow2.11.0 · 1×
tokenizers0.15.1 · 1×
torch2.0.1 · 1×
torchvision0.15.2 · 1×

For agents

$ claude mcp add InternVL \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact