[Update Blog] [Paper] [InternVL 1.5 Technical Report] [Chat Demo] [HuggingFace Demo] [Quick Start] [中文解读]
2024/04/28: We release the INT8 version of InternVL-Chat-V1-5, see HF link.2024/04/28: We achieve the SOTA performance (75.74) on the Infographics VQA benchmark, see here.2024/04/18: InternVL-Chat-V1.5 has been released at HF link, approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.2024/02/27: InternVL is accepted by CVPR 2024! 🎉2024/02/24: InternVL-Chat models have been included in the VLMEvalKit.2024/02/21: InternVL-Chat-V1.2-Plus achieves SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our blog for more details.2024/02/12: InternVL-Chat-V1.2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our blog, SFT data or try our demo. The model is now available on HuggingFace, and both training/evaluation data and scripts are open-sourced.2024/02/04: InternVL-Chat-V1.1 achieves 44.67% on MMVP, higher than GPT-4V!2024/01/27: We release 448 resolution model, achieving 76.6 on MMBench dev, see here.2024/01/24: InternVL-Chat-V1.1 is released, it supports Chinese and has stronger OCR capability, see here or try our demo.2024/01/16: We release our customized mmcv/mmsegmentation/mmdetection code, integrated with DeepSpeed, which can be used for training large-scale object detection and semantic segmentation models.InternVL scales up the ViT to 6B parameters and aligns it with LLM.
Vision Large Language Model
| Model | Date | Download | Note |
|---|---|---|---|
| InternVL−Chat−V1.5-Int8 | 2024.04.28 | 🤗 HF link | The INT8 version of InternVL-Chat-V1-5 |
| InternVL−Chat−V1.5 | 2024.04.18 | 🤗 HF link | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
| InternVL−Chat−V1.2−Plus | 2024.02.21 | 🤗 HF link | more SFT data and stronger |
| InternVL−Chat−V1.2 | 2024.02.11 | 🤗 HF link | scaling up LLM to 34B |
| InternVL−Chat−V1.1 | 2024.01.24 | 🤗 HF link | support Chinese and stronger OCR |
| InternVL−Chat−19B−448px | 2024.02.03 | 🤗 HF link | 448 resolution |
| InternVL−Chat−19B | 2023.12.25 | 🤗 HF link | English multimodal dialogue |
| InternVL−Chat−13B | 2023.12.25 | 🤗 HF link | English multimodal dialogue |
Vision-Language Foundation Model
| Model | Date | Download | Note |
|---|---|---|---|
| InternViT−6B−448px−V1.5 | 2024.04.20 | 🤗 HF link | support dynamic resolution, super strong OCR (🔥new) |
| InternViT−6B−448px−V1.2 | 2024.02.11 | 🤗 HF link | 448 resolution |
| InternViT−6B−448px−V1.0 | 2024.01.30 | 🤗 HF link | 448 resolution |
| InternViT−6B−224px | 2023.12.22 | 🤗 HF link | vision foundation model |
| InternVL−14B−224px | 2023.12.22 | 🤗 HF link | vision-language foundation model |
Visual Perception (click to expand)
ViT-22B uses the private JFT-3B dataset.
| method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
|---|---|---|---|---|---|---|---|
| OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
| DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
| EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
| MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
| ViT-22B* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | − |
| InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
| method | decoder | #param (train/total) | crop size | mIoU |
|---|---|---|---|---|
| OpenCLIP-G (frozen) | Linear | 0.3M / 1.8B | 512 | 39.3 |
| ViT-22B (frozen) | Linear | 0.9M / 21.7B | 504 | 34.6 |
| InternViT-6B (frozen) | Linear | 0.5M / 5.9B | 504 | 47.2 (+12.6) |
| ViT-22B (frozen) | UperNet | 0.8B / 22.5B | 504 | 52.7 |
| InternViT-6B (frozen) | UperNet | 0.4B / 6.3B | 504 | 54.9 (+2.2) |
| ViT-22B | UperNet | 22.5B / 22.5B | 504 | 55.3 |
| InternViT-6B | UperNet | 6.3B / 6.3B | 504 | 58.9 (+3.6) |
| method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
|---|---|---|---|---|---|---|
| OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
| EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
| ViT-22B* | 85.9 | 90.1 | 96.0 | 80.9 | − | 87.6 |
| InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian
| method | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |
|---|---|---|---|---|---|
| Taiyi-CLIP-ViT-H | - | 54.4 | - | - | - |
| WuKong-ViT-L-G | - | 57.5 | - | - | - |
| CN-CLIP-ViT-H | - | 59.6 | - | - | - |
| AltCLIP-ViT-L | 74.5 | 59.6 | - | - | - |
| EVA-02-CLIP-E+ | 82.0 | - | - | - | 41.2 |
| OpenCLIP-XLM-R-H | 77.0 | 55.7 | 53.1 | 37.0 | 56.8 |
| InternVL-C (ours) | 83.2 | 64.5 | 61.5 | 44.9 | 65.7 |
| method | #frame | K400 | K600 | K700 |
|---|---|---|---|---|
| OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
| EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
| InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
| ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
| InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
Cross-Modal Retrieval (click to expand)
| model | Flickr30K | COCO | avg | ||||||||||
| image-to-text | text-to-image | image-to-text | text-to-image | ||||||||||
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
| OpenCLIP-G | 92.9 | 99.3 | 99.8 | 79.5 | 95.0 | 97.1 | 67.3 | 86.9 | 92.6 | 51.4 | 74.9 | 83.0 | 85.0 |
| EVA-02-CLIP-E+ | 93.9 | 99.4 | 99.8 | 78.8 | 94.2 | 96.8 | 68.8 | 87.8 | 92.8 | 51.1 | 75.0 | 82.7 | 85.1 |
| EVA-CLIP-8B | 95.6 | 99.6 | 99.9 | 80.8 | 95.5 | 97.6 | 70.3 | 89.3 | 93.9 | 53.0 | 76.0 | 83.4 |
$ claude mcp add InternVL \
-- python -m otcore.mcp_server <graph>