hub / github.com/mit-han-lab/llm-awq

github.com/mit-han-lab/llm-awq @main sqlite

597 symbols 2,270 edges 74 files 47 documented · 8%

README

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

overview

The current release supports:

AWQ search for accurate quantization.
Pre-computed AWQ model zoo for LLMs (Llama-1/2/3, OPT, CodeLlama, StarCoder, Vicuna, VILA, LLaVA; load to generate quantized weights).
Memory-efficient 4-bit Linear in PyTorch.
Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (VILA).
Chunk prefilling for faster prefilling in multi-round Q&A setting.
State-of-the-art prefilling speed of LLMs/VLMs on edge devices: TinyChat 2.0.

Thanks to AWQ, TinyChat can deliver more efficient responses with LLM/VLM chatbots through 4-bit inference.

TinyChat with LLaMA-3-8b on RTX 4090 (2.7x faster than FP16):

TinyChat with LLaMA-3-8b on RTX 4090: W4A16 is 2.7x faster than FP16

TinyChat with LLaMA-3-8b on Jetson Orin (2.9x faster than FP16):

TinyChat with LLaMA-3-8b on Jetson Orin: W4A16 is 2.9x faster than FP16

TinyChat also supports inference with vision language models (e.g., VILA, LLaVA). In the following examples, W4A16 quantized models from VILA family are launched with TinyChat.

TinyChat with NVILA-8B on RTX 4090 (single-image inputs):

TinyChat with NVILA on 4090 single image

TinyChat with NVILA-8B on RTX 4090 (multi-image inputs):

TinyChat with NVILA on 4090 multiple images

TinyChat with video reasoning:

https://github.com/user-attachments/assets/b68a7a0d-5175-4030-985b-5ae0ae94f874

Prompt: What might be the next step according to the video?

Answer: The next step in the video could be to place the shaped dough onto a baking sheet and let it rise before baking.

Online demo: https://vila.hanlab.ai

Check out TinyChat, which offers a turn-key solution for on-device inference of LLMs and VLMs on resource-constrained edge platforms. With TinyChat, it is now possible to efficiently run large models on small and low-power devices even without Internet connection!

News

[2025/04] 🔥 AWQ now supports DeepSeek-R1-Distilled models. Try our example here!
[2025/02] AWQ now supports BF16 precision. See example here.
[2024/10] 🔥⚡ Explore advancements in TinyChat 2.0, the latest version with significant advancements in prefilling speed of Edge LLMs and VLMs, 1.5-1.7x faster than the previous version of TinyChat. Please refer to the README and blog for more details.
[2024/05] 🏆 AWQ receives the Best Paper Award at MLSys 2024. 🎉
[2024/05] 🔥 The VILA-1.5 model family which features video understanding is now supported in AWQ and TinyChat. Check out out online demo powered by TinyChat here. Example is here.
[2024/05] 🔥 AMD adopts AWQ to improve LLM serving efficiency.
[2024/04] 🔥 We released AWQ and TinyChat support for The Llama-3 model family! Check out our example here.
[2024/02] 🔥 AWQ has been accepted to MLSys 2024!
[2024/02] 🔥 We supported VILA Vision Languague Models in AWQ & TinyChat! Check our latest demos with multi-image inputs!
[2024/02] 🔥 We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!
[2024/01] 🔥 AWQ has been integrated by Google Vertex AI!
[2023/11] 🔥 AWQ has been integrated by Amazon Sagemaker Containers!
[2023/11] 🔥 We added AWQ support and pre-computed search results for CodeLlama, StarCoder, StableCode models. Checkout our model zoo here!
[2023/11] 🔥 AWQ is now integrated natively in Hugging Face transformers through from_pretrained. You can either load quantized models from the Hub or your own HF quantized models.
[2023/10] AWQ is integrated into NVIDIA TensorRT-LLM
[2023/09] AWQ is integrated into Intel Neural Compressor, FastChat, vLLM, HuggingFace TGI, and LMDeploy.
[2023/09] ⚡ Check out our latest TinyChat, which is ~2x faster than the first release on Orin!
[2023/09] ⚡ Check out AutoAWQ, a third-party implementation to make AWQ easier to expand to new models, improve inference speed, and integrate into Huggingface.
[2023/07] 🔥 We released TinyChat, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. Llama-2-chat models are supported! Check out our implementation here.
[2023/07] 🔥 We added AWQ support and pre-computed search results for Llama-2 models (7B & 13B). Checkout our model zoo here!
[2023/07] We extended the support for more LLM models including MPT, Falcon, and BLOOM.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
News
Contents
Helpful Links
Install
AWQ Model Zoo
Examples
Usage
Results on Visual Language Models
Reference
Related Projects

Helpful Links

VILA online demo: Visual Language Models efficiently supported by AWQ & TinyChat.
LLM on the Edge: AWQ and TinyChat support edge GPUs such as NVIDIA Jetson Orin.
VLMs on Laptop: Follow the instructions to deploy VLMs on NVIDIA Laptops with TinyChat.
Gradio Server: Try to build your own VLM online demo with AWQ and TinyChat!
QServe: 🔥 [New] Efficient and accurate serving system for large-scale LLM inference.

Install

Clone this repository and navigate to AWQ folder

git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq

Install Package

conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

For edge devices like Orin, before running the commands above, please:
1. Modify pyproject.toml by commenting out this line.
2. Manually install precompiled PyTorch binaries (>=2.0.0) from NVIDIA. You also need to install torchvision from this website when running NVILA.
3. Set the appropriate Python version for conda environment (e.g., conda create -n awq python=3.8 -y for JetPack 5).
Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel and optimized FP16 kernels (e.g. layernorm, positional encodings).

cd awq/kernels
python setup.py install

Install Flash Attention

pip install flash-attn --no-build-isolation

We recommend starting an interactive python CLI interface and run import flash_attn to check whether FlashAttention-2 is installed successfully. If not, we recommend downloading pre-built wheels from here. Please notice:

PyTorch version needs to exactly match with the version specified in the .whl name;
Check out both cxx11abiTRUE and cxx11abiFALSE wheels if one of them does not work;
It's recommended to match CUDA version specified in the .whl filename, but minor mismatches (e.g. 12.1 vs 12.2, or even 11.8 vs 12.2) usually do not matter.
[Optional] In order to run AWQ and TinyChat with NVILA model family, please install VILA:

git clone https://github.com/NVlabs/VILA.git
cd VILA
pip install -e .

AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

The detailed support list:

Models	Sizes	INT4-g128	INT3-g128
DeepSeek-R1-Distill	1.5B/7B/8B	✅
Qwen-2.5	7B/72B	✅
NVILA	3B/8B	✅
VILA-1.5	3B/8B/13B/40B	✅	✅
Llama3	8B/70B	✅	✅
VILA	7B/13B	✅
Llama2	7B/13B/70B	✅	✅
LLaMA	7B/13B/30B/65B	✅	✅
OPT	125m/1.3B/2.7B/6.7B/13B/30B	✅	✅
CodeLlama	7B/13B/34B	✅	✅
StarCoder	15.5B	✅	✅
Vicuna-v1.1	7B/13B	✅
LLaVA-v0	13B	✅

Note: We only list models that we have prepare the AWQ searching results in the table above. AWQ also supports models such as LLaVA-v1.5 7B, and you may need to run the AWQ search on your own to quantize these models. For our latest VLM NVILA, quantized weights are available here.

Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples for details.

overview

Usage

We provide several sample script to run AWQ (please refer to ./scripts). We use Llama3-8B as an example.

Perform AWQ search and save search results (we already did it for you):

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt

Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend fake

Generate real quantized weights (INT4)

mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/llama3-8b-w4-g128-awq.pt

Load and evaluate the real quantized model (now you can see smaller gpu memory usage)

python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/llama3-8b-w4-g128-awq.pt

Results on Visual Language Models

AWQ also seamlessly supports large multi-modal models (LMMs). Please refer to [TinyChat](./tinychat/RE

Core symbols most depended-on inside this repo

from_pretrained

called by 30

tinychat/models/nvila_qwen2.py

_auto_get_scale

called by 25

awq/quantize/auto_scale.py

to_gradio_chatbot

called by 15

tinychat/serve/llava_conv.py

update

called by 14

tinychat/utils/conversation_utils.py

make_quant_attn

called by 13

tinychat/modules/fused_attn.py

process_images

called by 12

tinychat/utils/llava_image_processing.py

insert_prompt

called by 12

tinychat/utils/prompt_templates.py

copy

called by 12

tinychat/serve/llava_conv.py

Shape

Method 300

Function 183

Class 103

Route 11

Languages

Python100%

Modules by API surface

tinychat/utils/prompt_templates.py35 symbols

tinychat/serve/controller.py30 symbols

tinychat/models/nvila/llava_arch.py29 symbols

tinychat/models/internvl/internvit.py29 symbols

tinychat/models/qwen2.py27 symbols

tinychat/models/llama.py24 symbols

tinychat/models/mpt.py23 symbols

tinychat/serve/gradio_web_server.py20 symbols

tinychat/models/falcon.py20 symbols

tinychat/modules/fused_siglipdecoder.py18 symbols

awq/quantize/w8a8_linear.py17 symbols

tinychat/serve/model_worker_new.py15 symbols

Dependencies from manifests, versioned

accelerate0.34.2 · 1×

attributedict1×

fastapi1×

gradio3.35.2 · 1×

gradio_client0.2.9 · 1×

lm_eval0.3.0 · 1×

protobuf1×

pydantic1.10.19 · 1×

sentencepiece1×

texttable1×

tokenizers0.12.1 · 1×

toml1×

For agents

$ claude mcp add llm-awq \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/mit-han-lab/llm-awq @main sqlite

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

News

Contents

Helpful Links

Install

AWQ Model Zoo

Examples

Usage

Results on Visual Language Models

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents