hub / github.com/linkedin/Liger-Kernel

github.com/linkedin/Liger-Kernel @v0.8.0 sqlite

repository ↗ · DeepWiki ↗ · release v0.8.0 ↗

2,101 symbols 9,646 edges 267 files 709 documented · 34%

README

Liger Kernel: Efficient Triton Kernels for LLM Training

Stable	Nightly	Discord

Supercharge Your Model with Liger Kernel

With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.

Speed Up	Memory Reduction

Note: - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s. - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.

Optimize Post Training with Liger Kernel

<img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/post-training.png" width="50%" alt="Post Training">

We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules.

from liger_kernel.chunked_loss import LigerFusedLinearORPOLoss
orpo_loss = LigerFusedLinearORPOLoss()
y = orpo_loss(lm_head.weight, x, target)

Examples

Use Case	Description
Hugging Face Trainer	Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP
Lightning Trainer	Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3
Medusa Multi-head LLM (Retraining Phase)	Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP
Vision-Language Model SFT	Finetune Qwen2-VL on image-text data using 4 A100s with FSDP
Liger ORPO Trainer	Align Llama 3.2 using Liger ORPO Trainer with FSDP with 50% memory reduction

Key Features

Ease of use: Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
Time and memory efficient: In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques.
Exact: Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
Lightweight: Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
Multi-GPU supported: Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
Trainer Framework Integration: Axolotl, LLaMa-Factory, SFTTrainer, Hugging Face Trainer, SWIFT, oumi

Installation

Dependencies

CUDA

torch >= 2.1.2
triton >= 2.3.0

ROCm

torch >= 2.5.0 Install according to the instruction in Pytorch official webpage.
triton >= 3.0.0 Install from pypi. (e.g. pip install triton==3.0.0)

pip install -e .[dev]
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/

Optional Dependencies

transformers >= 4.x: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.

Note: Our kernels inherit the full spectrum of hardware compatibility offered by Triton.

To install the stable version:

$ pip install liger-kernel

To install the nightly version:

$ pip install liger-kernel-nightly

To install from source:

git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel

# Install Default Dependencies
# Setup.py will detect whether you are using AMD or NVIDIA
pip install -e .

# Setup Development Dependencies
pip install -e ".[dev]"

# NOTE -> For AMD users only
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/

Getting Started

There are a couple of ways to apply Liger kernels, depending on the level of customization required.

1. Use AutoLigerKernelForCausalLM

Using the AutoLigerKernelForCausalLM is the simplest approach, as you don't have to import a model-specific patching API. If the model type is supported, the modeling code will be automatically patched using the default settings.

from liger_kernel.transformers import AutoLigerKernelForCausalLM

# This AutoModel wrapper class automatically monkey-patches the
# model with the optimized Liger kernels if the model is supported.
model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model")

2. Apply Model-Specific Patching APIs

Using the patching APIs, you can swap Hugging Face models with optimized Liger Kernels.

import transformers
from liger_kernel.transformers import apply_liger_kernel_to_llama

# 1a. Adding this line automatically monkey-patches the model with the optimized Liger kernels
apply_liger_kernel_to_llama()

# 1b. You could alternatively specify exactly which kernels are applied
apply_liger_kernel_to_llama(
  rope=True,
  swiglu=True,
  cross_entropy=True,
  fused_linear_cross_entropy=False,
  rms_norm=False
)

# 2. Instantiate patched model
model = transformers.AutoModelForCausalLM("path/to/llama/model")

3. Compose Your Own Model

You can take individual kernels to compose your models.

from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss
import torch.nn as nn
import torch

model = nn.Linear(128, 256).cuda()

# fuses linear + cross entropy layers together and performs chunk-by-chunk computation to reduce memory
loss_fn = LigerFusedLinearCrossEntropyLoss()

input = torch.randn(4, 128, requires_grad=True, device="cuda")
target = torch.randint(256, (4, ), device="cuda")

loss = loss_fn(model.weight, input, target)
loss.backward()

High-level APIs

AutoModel

AutoModel Variant	API
AutoModelForCausalLM	`liger_kernel.transformers.AutoLigerKernelForCausalLM`

Patching

Model	API	Supported Operations
Llama4 (Text) & (Multimodal)	`liger_kernel.transformers.apply_liger_kernel_to_llama4`	RMSNorm, LayerNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy
LLaMA 2 & 3	`liger_kernel.transformers.apply_liger_kernel_to_llama`	RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy
LLaMA 3.2-Vision	`liger_kernel.transformers.apply_liger_kernel_to_mllama`	RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy
Ministral	`liger_kernel.transformers.apply_liger_kernel_to_ministral`	RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy
Mistral	`liger_kernel.transformers.apply_liger_kernel_to_mistral`	RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy
Mixtral	`liger_kernel.transformers.apply_liger_kernel_to_mixtral`	RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy
Nemotron	`liger_kernel.transformer

Core symbols most depended-on inside this repo

assert_verbose_allclose

src/liger_kernel/ops/dyt.py

backward

called by 162

benchmark/scripts/benchmark_tiled_mlp.py

run_benchmarks

called by 146

benchmark/scripts/utils.py

infer_device

called by 118

src/liger_kernel/utils.py

_patch_rms_norm_module

called by 91

src/liger_kernel/transformers/monkey_patch.py

fwd_fn

called by 75

benchmark/scripts/benchmark_fused_moe.py

fwd

called by 67

benchmark/scripts/benchmark_tvd.py

Shape

Function 1,307

Method 507

Class 260

Route 27

Languages

Python100%

Modules by API surface

test/transformers/test_monkey_patch.py87 symbols

test/utils.py72 symbols

src/liger_kernel/transformers/monkey_patch.py50 symbols

test/chunked_loss/test_dpo_loss.py42 symbols

benchmark/scripts/benchmark_mhc_lm.py37 symbols

test/transformers/test_cross_entropy.py34 symbols

src/liger_kernel/ops/backends/_ascend/ops/mhc.py32 symbols

src/liger_kernel/ops/mhc.py29 symbols

src/liger_kernel/transformers/functional.py28 symbols

benchmark/scripts/utils.py22 symbols

src/liger_kernel/transformers/swiglu.py21 symbols

test/chunked_loss/test_grpo_loss.py20 symbols

Used by 1 indexed graphs manifest dependencies, hub-wide

github.com/PrimeIntellect-ai/verifiers

Dependencies from manifests, versioned

accelerate1.6.0 · 1×

transformers4.51.3 · 1×

For agents

$ claude mcp add Liger-Kernel \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact