hub / github.com/MiniMax-AI/MiniMax-01

github.com/MiniMax-AI/MiniMax-01 @main sqlite

12 symbols 49 edges 3 files 0 documented · 0%

README

  <img src="https://github.com/MiniMax-AI/MiniMax-01/raw/main/figures/MiniMaxLogo-Light.png" width="60%" alt="MiniMax">

MiniMax-01

1. Introduction

We are delighted to introduce two remarkable models, MiniMax-Text-01 and MiniMax-VL-01. MiniMax-Text-01 is a powerful language model boasting 456 billion total parameters, with 45.9 billion activated per token. To unlock its long-context capabilities, it adopts a hybrid architecture integrating Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies like Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, and Expert Tensor Parallel (ETP), its training context length extends to 1 million tokens, and it can handle up to 4 million tokens during inference. Consequently, MiniMax-Text-01 showcases top-tier performance on various academic benchmarks. Building on MiniMax-Text-01's prowess, we developed MiniMax-VL-01 for enhanced visual capabilities. It uses the "ViT-MLP-LLM" framework common in multimodal LLMs. It is initialized and trained using three key components: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and MiniMax-Text-01 as the base LLM. This model features a dynamic resolution mechanism. Input images are resized according to a pre-set grid, with resolutions ranging from 336×336 to 2016×2016, while maintaining a 336×336 thumbnail. The resized images are split into non - overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined to form a full image representation. As a result, MiniMax-VL-01 has achieved top-level performance on multimodal leaderboards, demonstrating its edge in complex multimodal tasks.

2. Model Architecture

The architecture of MiniMax-Text-01 is briefly described as follows: - Total Parameters: 456B - Activated Parameters per Token: 45.9B - Number Layers: 80 - Hybrid Attention: a softmax attention is positioned after every 7 lightning attention. - Number of attention heads: 64 - Attention head dimension: 128 - Mixture of Experts: - Number of experts: 32 - Expert hidden dimension: 9216 - Top-2 routing strategy - Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000 - Hidden Size: 6144 - Vocab Size: 200,064

For MiniMax-VL-01, the additional ViT architecture details is as follows: - Total Parameters: 303M - Number of layers: 24 - Patch size: 14 - Hidden size: 1024 - FFN hidden size: 4096 - Number of heads: 16 - Attention head dimension: 64

3. Evaluation

Text Benchmarks

Core Academic Benchmarks

Tasks	GPT-4o (11-20)	Claude-3.5-Sonnet (10-22)	Gemini-1.5-Pro (002)	Gemini-2.0-Flash (exp)	Qwen2.5-72B-Inst.	DeepSeek-V3	Llama-3.1-405B-Inst.	MiniMax-Text-01
General
MMLU^*	85.7	88.3	86.8	86.5	86.1	88.5	88.6	88.5
MMLU-Pro^*	74.4	78.0	75.8	76.4	71.1	75.9	73.3	75.7
SimpleQA	39.0	28.1	23.4	26.6	10.3	24.9	23.2	23.7
C-SimpleQA	64.6	56.8	59.4	63.3	52.2	64.8	54.7	67.4
IFEval (avg)	84.1	90.1	89.4	88.4	87.2	87.3	86.4	89.1
Arena-Hard	92.4	87.6	85.3	72.7	81.2	91.4	63.5	89.1
Reasoning
GPQA^* (diamond)	46.0	65.0	59.1	62.1	49.0	59.1	50.7	54.4
DROP^* (F1)	89.2	88.8	89.2	89.3	85.0	91.0	92.5	87.8
Mathematics
GSM8k^*	95.6	96.9	95.2	95.4	95.8	96.7	96.7	94.8
MATH^*	76.6	74.1	84.6	83.9	81.8	84.6	73.8	77.4
Coding
MBPP +	76.2	75.1	75.4	75.9	77.0	78.8	73.0	71.7
HumanEval	90.2	93.7	86.6	89.6	86.6	92.1	89.0	86.9

^* Evaluated following a 0-shot CoT setting.

Long Benchmarks

4M Needle In A Haystack Test

Ruler | Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M | |-------|----|----|-----|-----|-----|------|------|------|----| | GPT-4o (11-20) | 0.970 | 0.921 | 0.890 | 0.888 | 0.884 | - | - | - | - | | Claude-3.5-Sonnet (10-22) | 0.965 | 0.960 | 0.957 | 0.950 | 0.952 | 0.938 | - | - | - | | Gemini-1.5-Pro (002) | 0.962 | 0.960 | 0.960 | 0.958 | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 | | Gemini-2.0-Flash (exp) | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | - | | MiniMax-Text-01 | 0.963 | 0.961 | 0.953 | 0.954 | 0.943 | 0.947 | 0.945 | 0.928 | 0.910 |

LongBench v2 | Model | overall | easy | hard | short | medium | long | |----------------------------|-------------|----------|----------|------------|------------|----------| | Human | 53.7 | 100.0 | 25.1 | 47.2 | 59.1 | 53.7 | | w/ CoT | | | | | | | | GPT-4o (11-20) | 51.4 | 54.2 | 49.7 | 59.6 | 48.6 | 43.5 | | Claude-3.5-Sonnet (10-22) | 46.7 | 55.2 | 41.5 | 53.9 | 41.9 | 44.4 | | Deepseek-V3 | - | - | - | - | - | - | | Qwen2.5-72B-Inst. | 43.5 | 47.9 | 40.8 | 48.9 | 40.9 |

Core symbols most depended-on inside this repo

generate_quanto_config

called by 1

inference/minimax-vl-01.py

parse_args

called by 1

inference/minimax-vl-01.py

check_params

called by 1

inference/minimax-vl-01.py

main

called by 1

inference/minimax-vl-01.py

generate_quanto_config

called by 1

inference/minimax-text-01.py

parse_args

called by 1

inference/minimax-text-01.py

check_params

called by 1

inference/minimax-text-01.py

main

called by 1

inference/minimax-text-01.py

Shape

Function 12

Languages

Python100%

Modules by API surface

inference/minimax-vl-01.py5 symbols

inference/minimax-text-01.py4 symbols

evaluation/MR-NIAH/score.py3 symbols

Dependencies from manifests, versioned

accelerate1.2.1 · 1×

optimum-quanto0.2.1 · 1×

quanto0.2.0 · 1×

transformers4.47.1 · 1×

For agents

$ claude mcp add MiniMax-01 \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact