hub / github.com/OpenGVLab/InternVL

github.com/OpenGVLab/InternVL @v1.5.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.5.0 ↗

2,160 symbols 7,016 edges 355 files 375 documented · 17%

README

InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-4V

[Update Blog] [Paper] [InternVL 1.5 Technical Report] [Chat Demo] [HuggingFace Demo] [Quick Start] [中文解读]

News🚀🚀🚀

2024/04/28: We release the INT8 version of InternVL-Chat-V1-5, see HF link.
2024/04/28: We achieve the SOTA performance (75.74) on the Infographics VQA benchmark, see here.
2024/04/18: InternVL-Chat-V1.5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.
2024/02/27: InternVL is accepted by CVPR 2024! 🎉
2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.
2024/02/21: InternVL-Chat-V1.2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.
2024/02/12: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog, SFT data or try our demo. The model is now available on HuggingFace, and both training/evaluation data and scripts are open-sourced.
2024/02/04: InternVL-Chat-V1.1 achieves 44.67% on MMVP, higher than GPT-4V!
2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, see here.
2024/01/24: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see here or try our demo.
2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.

Documents

How to install InternVL? [link]
How to fine-tune InternVL? [link]
How to evaluate InternVL-Chat-V1-5? [link]
How to evaluate InternVL-Chat-V1-5 using VLMEvalKit? (Recommend) [link]
How to deploy a local demo? [link]
How to run InternVL 1.5-8bit with Nvidia V100 GPU? [link]

Compared with SOTA VLLMs

What is InternVL?

InternVL scales up the ViT to 6B parameters and aligns it with LLM.

Model Zoo

Vision Large Language Model

Model	Date	Download	Note
InternVL−Chat−V1.5-Int8	2024.04.28	🤗 HF link	The INT8 version of InternVL-Chat-V1-5
InternVL−Chat−V1.5	2024.04.18	🤗 HF link	support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)
InternVL−Chat−V1.2−Plus	2024.02.21	🤗 HF link	more SFT data and stronger
InternVL−Chat−V1.2	2024.02.11	🤗 HF link	scaling up LLM to 34B
InternVL−Chat−V1.1	2024.01.24	🤗 HF link	support Chinese and stronger OCR
InternVL−Chat−19B−448px	2024.02.03	🤗 HF link	448 resolution
InternVL−Chat−19B	2023.12.25	🤗 HF link	English multimodal dialogue
InternVL−Chat−13B	2023.12.25	🤗 HF link	English multimodal dialogue

Vision-Language Foundation Model

Model	Date	Download	Note
InternViT−6B−448px−V1.5	2024.04.20	🤗 HF link	support dynamic resolution, super strong OCR (🔥new)
InternViT−6B−448px−V1.2	2024.02.11	🤗 HF link	448 resolution
InternViT−6B−448px−V1.0	2024.01.30	🤗 HF link	448 resolution
InternViT−6B−224px	2023.12.22	🤗 HF link	vision foundation model
InternVL−14B−224px	2023.12.22	🤗 HF link	vision-language foundation model

What can InternVL do?

Visual Perception (click to expand)

Linear-Probe Image Classification [see details]

ViT-22B uses the private JFT-3B dataset.

method	#param	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch
OpenCLIP-G	1.8B	86.2	89.4	77.2	63.8	87.8	66.4
DINOv2-g	1.1B	86.5	89.6	78.4	75.9	78.8	62.5
EVA-01-CLIP-g	1.1B	86.5	89.3	77.4	70.5	87.7	63.1
MAWS-ViT-6.5B	6.5B	87.8	-	-	-	-	-
ViT-22B*	21.7B	89.5	90.9	83.2	83.8	87.4	−
InternViT-6B (ours)	5.9B	88.2	90.4	79.9	77.5	89.8	69.1

Semantic Segmentation [see details]

method	decoder	#param (train/total)	crop size	mIoU
OpenCLIP-G (frozen)	Linear	0.3M / 1.8B	512	39.3
ViT-22B (frozen)	Linear	0.9M / 21.7B	504	34.6
InternViT-6B (frozen)	Linear	0.5M / 5.9B	504	47.2 (+12.6)
ViT-22B (frozen)	UperNet	0.8B / 22.5B	504	52.7
InternViT-6B (frozen)	UperNet	0.4B / 6.3B	504	54.9 (+2.2)
ViT-22B	UperNet	22.5B / 22.5B	504	55.3
InternViT-6B	UperNet	6.3B / 6.3B	504	58.9 (+3.6)

Zero-Shot Image Classification [see details]

method	IN-1K	IN-A	IN-R	IN-V2	IN-Sketch	ObjectNet
OpenCLIP-G	80.1	69.3	92.1	73.6	68.9	73.0
EVA-02-CLIP-E+	82.0	82.1	94.5	75.7	71.6	79.6
ViT-22B*	85.9	90.1	96.0	80.9	−	87.6
InternVL-C (ours)	83.2	83.8	95.5	77.3	73.9	80.6

Multilingual Zero-Shot Image Classification [see details]

EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian

method	IN-1K (EN)	IN-1K (ZH)	IN-1K (JP)	IN-1K (AR)	IN-1K (IT)
Taiyi-CLIP-ViT-H	-	54.4	-	-	-
WuKong-ViT-L-G	-	57.5	-	-	-
CN-CLIP-ViT-H	-	59.6	-	-	-
AltCLIP-ViT-L	74.5	59.6	-	-	-
EVA-02-CLIP-E+	82.0	-	-	-	41.2
OpenCLIP-XLM-R-H	77.0	55.7	53.1	37.0	56.8
InternVL-C (ours)	83.2	64.5	61.5	44.9	65.7

Zero-Shot Video Classification [see details]

method	#frame	K400	K600	K700
OpenCLIP-G	1	65.9	66.1	59.2
EVA-02-CLIP-E+	1	69.8	69.3	63.4
InternVL-C (ours)	1	71.0	71.3	65.7
ViCLIP	8	75.7	73.5	66.4
InternVL-C (ours)	8	79.4	78.8	71.5

Cross-Modal Retrieval (click to expand)

English Zero-Shot Image-Text Retrieval [see details]

model	Flickr30K	COCO	avg
image-to-text	text-to-image	image-to-text	text-to-image
R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
OpenCLIP-G	92.9	99.3	99.8	79.5	95.0	97.1	67.3	86.9	92.6	51.4	74.9	83.0	85.0
EVA-02-CLIP-E+	93.9	99.4	99.8	78.8	94.2	96.8	68.8	87.8	92.8	51.1	75.0	82.7	85.1
EVA-CLIP-8B	95.6	99.6	99.9	80.8	95.5	97.6	70.3	89.3	93.9	53.0	76.0	83.4

Core symbols most depended-on inside this repo

called by 310

clip_benchmark/clip_benchmark/models/japanese_clip.py

register_conv_template

called by 66

internvl_chat/internvl/conversation.py

update

called by 66

classification/utils.py

from_pretrained

called by 58

internvl_chat/internvl/model/internvl_chat/configuration_intern_vit.py

from_pretrained

called by 44

internvl_chat_llava/llava/model/language_model/mpt/adapt_tokenizer.py

tokenizer_image_token

called by 36

internvl_chat_llava/llava/mm_utils.py

append_message

called by 27

internvl_chat/internvl/conversation.py

state_dict

called by 27

classification/utils.py

Shape

Method 1,142

Function 631

Class 370

Route 17

Languages

Python100%

TypeScript1%

Modules by API surface

internvl_chat/internvl/model/internlm2/modeling_internlm2.py71 symbols

internvl_chat_llava/llava/model/multimodal_encoder/eva_clip/modeling_evaclip.py66 symbols

internvl_g/internvl/model/internvl_stage2_retrieval/modeling_qllama.py52 symbols

internvl_g/internvl/model/internvl_stage2/modeling_qllama.py52 symbols

internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_qllama.py52 symbols

clip_benchmark/clip_benchmark/models/internvl_huggingface/modeling_qllama.py52 symbols

internvl_chat_llava/llava/train/train_custom.py41 symbols

classification/models/intern_vit_6b.py38 symbols

internvl_g/internvl/model/internvl_stage2_retrieval/modeling_internvl.py37 symbols

internvl_g/internvl/model/internvl_stage2/modeling_internvl.py37 symbols

clip_benchmark/clip_benchmark/models/internvl_c_pytorch/internvl_c.py36 symbols

classification/dataset/cached_image_folder.py36 symbols

Dependencies from manifests, versioned

accelerate1×

bitsandbytes0.41.0 · 1×

open_clip_torch0.2.1 · 1×

peft0.4.0 · 1×

protobuf3.20.3 · 1×

scikit-learn1.0 · 1×

sentencepiece0.1.99 · 1×

shortuuid1×

tensorflow2.11.0 · 1×

tokenizers0.15.1 · 1×

torch2.0.1 · 1×

torchvision0.15.2 · 1×

For agents

$ claude mcp add InternVL \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact