hub / github.com/2U1/Qwen-VL-Series-Finetune

github.com/2U1/Qwen-VL-Series-Finetune @main sqlite

239 symbols 979 edges 34 files 40 documented · 17%

README

Fine-tuning Qwen-VL Series

This repository contains a script for training Qwen2-VL, Qwen2.5-VL , Qwen3-VL and Qwen3.5 with only using HuggingFace and Liger-Kernel.

Other projects

[Phi3-Vision Finetuning]

[Llama3.2-Vision Finetuning]

[Molmo Finetune]

[Pixtral Finetune]

[SmolVLM Finetune]

[Gemma3 Finetune]

Update

[2026/05/18] 🔥Upgrade to liger_kernel==0.8.0. Liger 0.8.0 adds official patches for qwen3_5 / qwen3_5_moe and ships LigerExperts, a fused MoE expert kernel that auto-accelerates qwen3_vl_moe and qwen3_5_moe under --use_liger_kernel True. The 0.7-era hardcoded fallback that force-disabled Liger for Qwen3.5 in SFT/DPO/GRPO has been removed, and the mm_token_type_ids GRPO wrapper is now skipped automatically on Liger ≥ 0.8.0 (kept as a no-op shim for older installs).
[2026/03/07] 🔥Supports reasoning mode training for Qwen3-VL and Qwen3.5
[2026/03/07] 🔥Supports Qwen3.5 Series.
[2026/03/07] Supports Qwen3-VL classification
[2026/03/07] Update codebase to transformers==5.3.0
[2025/11/28] 🔥Supports video training with DPO and GRPO.
[2025/11/27] 🔥Supports Qwen3-VL-MoE
[2025/11/26] Update support for liger-kernel in Qwen3-VL.
[2025/10/16] 🔥Supports Qwen3-VL(non-moe)
[2025/08/21] Add option for using 2-layer mlp for classification.
[2025/08/21] Add option for unfreeze only few layers for llm and vision tower.
[2025/08/08] 🔥Monkey patch Qwen2.5-VL's window attention and forward for using less memory and speedups.
[2025/07/25] Updated Classification training script.
[2025/05/29] 🔥Supports GRPO training.
[2025/04/16] 🔥Supports DPO training.
[2025/03/04] Add Option for using liger kernel.
[2025/02/18] 🔥Supports mixed-modality dataset with zero3.
[2025/02/05] Fixed code for properly use image.
[2025/02/03] Support Liger-kernel for Qwen2.5-VL.
[2025/02/03] 🔥Supports Qwen2.5-VL.
[2025/01/24] Add option for using DoRA.
[2025/01/24] Fix error in LoRA training.
[2025/01/18] 🔥Supports mixed-modality data.
[2024/09/12] 🔥Now the model is trained using Liger-Kernel.
[2024/09/11] Supports setting different learning rates to projector and vision model.
[2024/09/11] 🔥Supports multi-image and video training.

Fine-tuning Qwen-VL Series
Other projects
Update
Table of Contents
Supported Features
Docker
Installation
- Environments
- Using requirements.txt
- Using environment.yaml
Training Notes
Dataset Preparation
- Reasoning Format
Supervised Fine Tuning
- Full Finetuning
- Finetune with LoRA
- Train with video dataset
- Image Resolution for vram usage
- Merge LoRA Weights
- Evaluation during Training
- Step 1: Prepare Evaluation Dataset
- Step 2: Define compute_metrics Function
- Step 3: Modify Training Script
- Step 4: Add Evaluation Arguments
DPO Finetuning
GRPO Finetuning
- Prerequisites
Classification Finetuning
- Experimental Features
Inference
- Gradio Infernce (WebUI)
Issue for libcudnn error
TODO
Known Issues
License
Citation
Acknowledgement

[!WARNING] Read Training Notes before running any training script. It contains required settings and compatibility notes for Qwen3.5, QLoRA + vision, QLoRA + liger, DeepSpeed, and video training.

Supported Features

Deepspeed
LoRA/QLoRA
Full-finetuning
Enable finetuning vision_model while using LoRA
Unfreeze only top-k layer
Disable/enable Flash Attention 2
Multi-image and video training
Training optimized with liger kernel
Mixed-modality dataset
Direct Preference Optimization (DPO)
Group Relative Policy Optimization (GRPO)

Docker

To simplfy the setting process for training, you could use the provided pre-build environments.

The settings are done in the conda env named train.

You could find more information about the image here.

docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash

Installation

Environments

Ubuntu 22.04
Nvidia-Driver 550.120
Cuda version 12.8

Install the required packages using environment.yaml.

Using `requirements.txt`

pip install -r requirements.txt -f https://download.pytorch.org/whl/cu128
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Using `environment.yaml`

conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Note: You should install flash-attn after installing the other packages.

Training Notes

Qwen3.5 series: use --disable_flash_attn2 True for now. In local testing, Flash Attention 2 raised CUDA errors while sdpa was stable. This applies to SFT, CLS, DPO, and GRPO.
QLoRA + vision: do not combine quantization (--bits 4 / --bits 8) with vision training (--vision_lora True, --freeze_vision_tower False, or --unfreeze_topk_vision > 0). Use --bits 16 if you want to train vision-related modules.
QLoRA + liger: disable liger when using QLoRA.
DeepSpeed: zero2 is usually faster and often more stable than zero3, but it uses more memory.
Video: do not set fps and nframes at the same time.
Top-k unfreeze: if you use --unfreeze_topk_llm or --unfreeze_topk_vision, keep the corresponding base module frozen first with --freeze_llm True or --freeze_vision_tower True.
Learning rates: vision_model usually works better with a learning rate about 5x to 10x smaller than language_model.

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list.

Please see the example below and follow format your data.

Example for single image dataset

[
  {
    "id": "000000033471",
    "image": "000000033471.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat are the colors of the bus in the image?"
      },
      {
        "from": "gpt",
        "value": "The bus in the image is white and red."
      },
      {
        "from": "human",
        "value": "What feature can be seen on the back of the bus?"
      },
      {
        "from": "gpt",
        "value": "The back of the bus features an advertisement."
      },
      {
        "from": "human",
        "value": "Is the bus driving down the street or pulled off to the side?"
      },
      {
        "from": "gpt",
        "value": "The bus is driving down the street, which is crowded with people and other vehicles."
      }
    ]
  }
  ...
]

Example for multi image dataset

[
  {
    "id": "000000033471",
    "image": ["000000033471.jpg", "000000033472.jpg"],
    "conversations": [
      {
        "from": "human",
        "value": "<image>\n<image>\nIs the perspective of the camera differnt?"
      },
      {
        "from": "gpt",
        "value": "Yes, It the perspective of the camera is different."
      }
    ]
  }
  ...
]

Example for video dataset

[
  {
    "id": "sample1",
    "video": "sample1.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\nWhat is going on in this video?"
      },
      {
        "from": "gpt",
        "value": "A man is walking down the road."
      }
    ]
  }
  ...
]

Example for DPO dataet

[
  {
    "id": "e4cd017f-3854-4019-9cc1-9dacacb5d424",
    "image": "example_0_0.png",
    "prompt": "<image>\nDescribe the key features of the image in great detail.",
    "chosen": "The image features a modern, multi-colored train parked on a train track, occupying a significant portion of the scene. The train has a unique and striking design, with red, white, and blue colors, as well as yellow accents. It appears to be a high-speed train, indicating its advanced engineering and potential for fast travel.\n\nThe train is situated next to a platform, suggesting that it may be at a train station. The overall scene conveys a sense of transportation and modern technology, with the vibrant and sleek train as the main subject.",
    "rejected": "The image features a modern, colorful train parked on a set of train tracks. The train has a striking design, with red, white, and blue colors as well as yellow accents. It appears to be a high-speed train, ready for departure.\n\nThe train spans a significant portion of the image, from the left to the right side. The train tracks can be seen clearly beneath the train, emphasizing its position on the railway line. The overall scene gives off an impression of a contemporary and efficient mode of transportation."
  },
  {
    "id": "5e19e647-e5d3-4bcf-82e9-d262570743ae",
    "image": "example_1_0.png",
    "prompt": "<image>\nIs this bus in the USA?",
    "chosen": "Yes, based on the image, it can be assumed that this bus is in the USA. The location of the bus cannot be accurately determined.",
    "rejected": "No, it's not in the USA. The image does not provide specific information on where the bus is located. However, we can say that it's not in the United States."
  }
  ...
]

Example for GRPO dataset

[
  {
    "id": "06bc8a17-bb1c-4007-8c08-92c41e2628b2",
    "image": "image_2.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nBased on the image, which geometric method is used to determine the bearing angle, and why is it the most appropriate choice?"
      },
      {
        "from": "gpt",
        "reasoning": "Let's analyze the image step-by-step. The image shows a right-angled triangle with points B, C, and A. The angle at point B is a right angle, indicating that trigonometric functions can be applied. To find the bearing angle, we need to relate the sides of the triangle. The tangent function is suitable here because it relates the opposite side (BC) to the adjacent side (AB) in a right-angled triangle. By using the tangent function, we can calculate the angle at point A, which is the bearing angle. Therefore, the most appropriate geometric method is the use of trigonometric functions.",
        "value": "<answer>A</answer>"
      }
    ]
  }
  ...
]

Reasoning Format

You can keep using the normal dataset format, but if you want to train with an explicit reasoning trace you should add a separate reasoning field instead of manually concatenating <think>...</think> into value.

Use --enable_reasoning True only for the following model families:

Qwen/Qwen3-VL-*-Thinking
Qwen/Qwen3.5-*

When --enable_reasoning True is enabled, the dataset pipeline follows the official chat template behavior for supported models:

The assistant prompt scaffold is treated as prompt-only and masked out from the loss.
If a reasoning field is present, the prompt is prefixed with the model's reasoning prefill, such as <|im_start|>assistant\n<think>\n, and the label starts from the reasoning body.
The reasoning field is inserted into the reasoning block.
The value field is treated as the final answer body after the reasoning block.

This is intended to make training-time formatting match the model's default inference-time chat template as closely as possible for supported reasoning models.

For unsupported models such as Qwen2-VL, Qwen2.5-VL, and non-thinking Qwen3-VL-Instruct, --enable_reasoning True raises an error on purpose.

Qwen3.5 special case

Qwen3.5 is the only supported family where samples may mix reasoning and non-reasoning data under --enable_reasoning True.
If a Qwen3.5 sample has a reasoning field, the prompt uses the open thinking scaffold and the label starts from the reasoning body.
If a Qwen3.5 sample does not have a reasoning field, the dataset uses the official non-thinking scaffold <think>\n\n</think>\n\n as prompt-only and trains only on the

Core symbols most depended-on inside this repo

_flatten_vision_features

called by 15

src/train/monkey_patch_forward.py

get_image_features

called by 14

src/model/modeling_cls.py

get_input_embeddings

called by 10

src/model/modeling_cls.py

pad_sequence

called by 9

src/dataset/data_utils.py

get_mm_token_type_ids

called by 9

src/dataset/data_utils.py

get_video_features

called by 9

src/model/modeling_cls.py

get_qwen_vl_generation_backbone

called by 9

src/model/load_model.py

get_peft_state_non_lora_maybe_zero_3

called by 8

src/train/train_utils.py

Shape

Method 111

Function 100

Class 28

Languages

Python100%

Modules by API surface

src/model/modeling_cls.py57 symbols

src/trainer/grpo_trainer.py19 symbols

src/train/monkey_patch_forward.py17 symbols

src/dataset/data_utils.py13 symbols

src/trainer/sft_trainer.py11 symbols

src/dataset/cls_dataset.py10 symbols

src/trainer/dpo_trainer.py9 symbols

src/train/train_cls.py9 symbols

src/trainer/cls_trainer.py8 symbols

src/dataset/sft_dataset.py8 symbols

src/dataset/dpo_dataset.py8 symbols

src/train/train_sft.py7 symbols

Dependencies from manifests, versioned

GitPython3.1.44 · 1×

Jinja23.1.4 · 1×

MarkupSafe2.1.5 · 1×

PyYAML6.0.2 · 1×

Pygments2.19.1 · 1×

accelerate1.10.1 · 1×

aiohappyeyeballs2.6.1 · 1×

aiohttp3.11.18 · 1×

aiosignal1.3.2 · 1×

annotated-doc0.0.4 · 1×

annotated-types0.7.0 · 1×

asttokens3.0.0 · 1×

For agents

$ claude mcp add Qwen-VL-Series-Finetune \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/2U1/Qwen-VL-Series-Finetune @main sqlite

Fine-tuning Qwen-VL Series

Other projects

Update

Table of Contents

Supported Features

Docker

Installation

Environments

Using requirements.txt

Using environment.yaml

Training Notes

Dataset Preparation

Reasoning Format

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents

Using `requirements.txt`

Using `environment.yaml`