<img alt="KTransformers" src="https://github.com/user-attachments/assets/d5a2492f-a415-4456-af99-4ab102f13f8b" width=50%>
🎯 Overview | 🚀 Inference | 🎓 SFT | 🔥 Citation | 🚀 Roadmap(2026Q2)
KTransformers is a research project focused on efficient inference and fine-tuning of large language models through CPU-GPU heterogeneous computing. The project now exposes two user-facing capabilities from the kt-kernel source tree: Inference and SFT.
CPU-optimized kernel operations for heterogeneous LLM inference.
Key Features: - AMX/AVX Acceleration: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference - MoE Optimization: Efficient Mixture-of-Experts inference with NUMA-aware memory management - Quantization Support: CPU-side INT4/INT8 quantized weights, GPU-side GPTQ support - Easy Integration: Clean Python API for SGLang and other frameworks
Quick Start:
cd kt-kernel
pip install .
Use Cases:
Performance Examples: | Model | Hardware Configuration | Total Throughput | Output Throughput | |-------|------------------------|------------------|-------------------| | DeepSeek-R1-0528 (FP8) | 8×L20 GPU + Xeon Gold 6454S | 227.85 tokens/s | 87.58 tokens/s (8-way concurrency) |
KTransformers × LLaMA-Factory integration for ultra-large MoE model fine-tuning.

Key Features: - Multi-Backend Support: CPU/GPU hybrid fine-tuning with INT8/INT4 quantization - Ultra-Large MoE Support: Fine-tune models like DeepSeek-V3/R1 on limited GPU memory - Faster than ZeRO-Offload: 6-12x training speedup in benchmarked MoE SFT workloads - Lower CPU Memory: About half the CPU memory of the previous KT SFT path in the benchmarked setup - LLaMA-Factory Integration: Seamless integration with popular fine-tuning framework
| Model | GPU Memory | Training Speed | Hardware |
|---|---|---|---|
| DeepSeek-V3 | ~80GB total | 3.7 it/s | 4x RTX 4090 |
| DeepSeek-R1 | ~80GB total | 3.7 it/s | 4x RTX 4090 |
| Qwen3-30B-A3B | ~24GB total | 8+ it/s | 1x RTX 4090 |
Quick Start:
cd /path/to/LLaMA-Factory
pip install -e .
pip install -r requirements/ktransformers.txt
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--config_file examples/ktransformers/accelerate/fsdp2_kt_int8.yaml \
src/train.py \
examples/ktransformers/train_lora/qwen3_5moe_lora_sft_kt.yaml
👉 Quick Start → 👉 Full Documentation →
If you use KTransformers in your research, please cite our paper:
@inproceedings{10.1145/3731569.3764843,
title = {KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models},
author = {Chen, Hongtao and Xie, Weiyu and Zhang, Boxin and Tang, Jingqi and Wang, Jiahao and Dong, Jianwei and Chen, Shaoyuan and Yuan, Ziwei and Lin, Chen and Qiu, Chengyu and Zhu, Yuening and Ou, Qingliang and Liao, Jiaqi and Chen, Xianglin and Ai, Zhiyuan and Wu, Yongwei and Zhang, Mingxing},
booktitle = {Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles},
year = {2025}
}
Developed and maintained by: - MADSys Lab @ Tsinghua University - Approaching.AI - 9#AISoft - Community contributors
We welcome contributions! Please feel free to submit issues and pull requests.
The original integrated KTransformers framework has been archived to the archive/ directory for reference. The project now organizes the two capabilities above from the kt-kernel source tree for clearer documentation and maintenance.
For the original documentation with full quick-start guides and examples, see: - archive/README.md (English) - archive/README_ZH.md (中文)
$ claude mcp add ktransformers \
-- python -m otcore.mcp_server <graph>