Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome Diffusion Inference, please check 📖Awesome-DiT-Inference . For CUDA learn notes, please check 📖LeetCUDA
.
@misc{Awesome-LLM-Inference@2024,
title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
url={https://github.com/xlite-dev/Awesome-LLM-Inference},
note={Open-source software available at https://github.com/xlite-dev/Awesome-LLM-Inference},
author={xlite-dev, liyucheng09 etc},
year={2024}
}
Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.
python3 download_pdfs.py # The code is generated by Doubao AI
| Date | Title | Paper | Code | Recom |
|---|---|---|---|---|
| 2024.04 | 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech) | [docs] | [Open-Sora] |
⭐️⭐️ |
| 2024.04 | 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU) | [report] | [Open-Sora-Plan] |
⭐️⭐️ |
| 2024.05 | 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) | [pdf] | [DeepSeek-V2] |
⭐️⭐️ |
| 2024.05 | 🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) | [pdf] | [unilm-YOCO] |
⭐️⭐️ |
| 2024.06 | 🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) | [pdf] | [Mooncake] |
⭐️⭐️ |
| 2024.07 | 🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) | [pdf] | [flash-attention] |
⭐️⭐️ |
| 2024.07 | 🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) | [pdf] | [MInference 1.0] |
⭐️⭐️ |
| 2024.11 | 🔥🔥🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA) | [pdf] | [Star-Attention] |
⭐️⭐️ |
| 2024.12 | 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai) | [pdf] | [DeepSeek-V3] |
⭐️⭐️ |
| 2025.01 | 🔥🔥🔥 [MiniMax-Text-01] MiniMax-01: Scaling Foundation Models with Lightning Attention | [report] | [MiniMax-01] |
⭐️⭐️ |
| 2025.01 | 🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai) | [pdf] | [DeepSeek-R1] |
⭐️⭐️ |
| Date | Title | Paper | Code | Recom |
|---|---|---|---|---|
| 2024.05 | 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) | [pdf] | [DeepSeek-V2] |
⭐️⭐️ |
| 2024.12 | 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai) | [pdf] | [DeepSeek-V3] |
⭐️⭐️ |
| 2025.01 | 🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai) | [pdf] | [DeepSeek-R1] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(@deepseek-ai) | [pdf] | ⚠️ | ⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[FlashMLA] DeepSeek FlashMLA(@deepseek-ai) | ⚠️ | [FlashMLA] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[DualPipe] DeepSeek DualPipe(@deepseek-ai) | ⚠️ | [DualPipe] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[DeepEP] DeepSeek DeepEP(@deepseek-ai) | ⚠️ | [DeepEP] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[DeepGEMM] DeepSeek DeepGEMM(@deepseek-ai) | ⚠️ | [DeepGEMM] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[EPLB] DeepSeek EPLB(@deepseek-ai) | ⚠️ | [EPLB] |
⭐️⭐️ |
| 2025.02 | 🔥🔥🔥[3FS] DeepSeek 3FS(@deepseek-ai) | ⚠️ | [3FS] |
⭐️⭐️ |
| 2025.03 | 🔥🔥🔥[推理系统] DeepSeek-V3 / R1 推理系统概览 (@deepseek-ai) | [blog] | ⚠️ | ⭐️⭐️ |
| 2025.02 | 🔥🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs(@fudan.edu.cn) | [pdf] | [MHA2MLA] |
⭐️⭐️ |
| 2025.02 | 🔥🔥[TransMLA] TransMLA: Multi-head Latent Attention Is All You Need(@PKU) | [pdf] | [TransMLA] |
⭐️⭐️ |
| 2025.03 | 🔥🔥[X-EcoMLA] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(@AMD) | [pdf] | ⚠️ | ⭐️⭐️ |
| Date | Title | Paper | Code | Recom |
|---|---|---|---|---|
| 2019.10 | 🔥🔥[MP: ZeRO] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com) | [pdf] | [deepspeed] |
⭐️⭐️ |
| 2020.05 | 🔥🔥[TP: Megatron-LM] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) | [pdf] | [Megatron-LM] |
⭐️⭐️ |
| 2022.05 | 🔥🔥[SP: Megatron-LM] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA) | [pdf] | [Megatron-LM] |
⭐️⭐️ |
| 2023.05 | 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley) | [pdf] | [RingAttention] |
⭐️⭐️ |
| 2023.10 | 🔥🔥[SP: Ring Attention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley) | [pdf] | [RingAttention] |
⭐️⭐️ |
| 2023.11 | 🔥🔥[SP: STRIPED ATTENTION] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc) | [pdf] | [striped_attention] |
⭐️⭐️ |
| 2023.10 | 🔥🔥[SP: DEEPSPEED ULYSSES] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com) | [pdf] | [deepspeed] |
⭐️⭐️ |
| 2024.03 | 🔥🔥[CP: Megatron-LM] Megatron-LM: Context parallelism overview(@NVIDIA) | [docs] | [Megatron-LM] |
⭐️⭐️ |
| 2024.05 | 🔥🔥[SP: Unified Sequence Parallel (USP)] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent) | [pdf] | [long-context-attention] |
⭐️⭐️ |
| 2024.11 | 🔥🔥[CP: Meta] Context Parallelis |
$ claude mcp add Awesome-LLM-Inference \
-- python -m otcore.mcp_server <graph>