hub / github.com/xlite-dev/Awesome-LLM-Inference

github.com/xlite-dev/Awesome-LLM-Inference @v2.6.20 sqlite

repository ↗ · DeepWiki ↗ · release v2.6.20 ↗

3 symbols 10 edges 1 files 2 documented · 67%

README

📒Introduction

Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome Diffusion Inference, please check 📖Awesome-DiT-Inference . For CUDA learn notes, please check 📖LeetCUDA .

📖 News 🔥🔥

[2025-06-16]: DBCache: Dual Block Caching is release! A Training-free UNet-style Cache Acceleration for Diffusion Transformers! Feel free to take a try!

©️Citations

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/xlite-dev/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/xlite-dev/Awesome-LLM-Inference},
  author={xlite-dev, liyucheng09 etc},
  year={2024}
}

🎉Awesome LLM Inference Papers with Codes

Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.

🎉Download All PDFs

python3 download_pdfs.py # The code is generated by Doubao AI

📖Contents

📖Trending LLM/VLM Topics🔥🔥🔥
📖DeepSeek/MLA Topics🔥🔥🔥
📖Multi-GPUs/Multi-Nodes Parallelism🔥🔥🔥
📖Disaggregating Prefill and Decoding🔥🔥🔥
📖LLM Algorithmic/Eval Survey
📖LLM Train/Inference Framework/Design
📖Weight/Activation Quantize/Compress🔥
📖Continuous/In-flight Batching
📖IO/FLOPs-Aware/Sparse Attention🔥
📖KV Cache Scheduling/Quantize/Dropping🔥
📖Prompt/Context Compression🔥
📖Long Context Attention/KV Cache Optimization🔥🔥
📖Early-Exit/Intermediate Layer Decoding
📖Parallel Decoding/Sampling🔥
📖Structured Prune/KD/Weight Sparse
📖Mixture-of-Experts(MoE) LLM Inference🔥
📖CPU/NPU/FPGA/Mobile Inference
📖Non Transformer Architecture🔥
📖GEMM/Tensor Cores/WMMA/Parallel
📖VLM/Position Embed/Others

📖Trending LLM/VLM Topics (©️back👆🏻)

Date	Title	Paper	Code	Recom
2024.04	🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech)	[docs]	[Open-Sora]	⭐️⭐️
2024.04	🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU)	[report]	[Open-Sora-Plan]	⭐️⭐️
2024.05	🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)	[pdf]	[DeepSeek-V2]	⭐️⭐️
2024.05	🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)	[pdf]	[unilm-YOCO]	⭐️⭐️
2024.06	🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI)	[pdf]	[Mooncake]	⭐️⭐️
2024.07	🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc)	[pdf]	[flash-attention]	⭐️⭐️
2024.07	🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft)	[pdf]	[MInference 1.0]	⭐️⭐️
2024.11	🔥🔥🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA)	[pdf]	[Star-Attention]	⭐️⭐️
2024.12	🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai)	[pdf]	[DeepSeek-V3]	⭐️⭐️
2025.01	🔥🔥🔥 [MiniMax-Text-01] MiniMax-01: Scaling Foundation Models with Lightning Attention	[report]	[MiniMax-01]	⭐️⭐️
2025.01	🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai)	[pdf]	[DeepSeek-R1]	⭐️⭐️

📖DeepSeek/Multi-head Latent Attention(MLA) (©️back👆🏻)

Date	Title	Paper	Code	Recom
2024.05	🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI)	[pdf]	[DeepSeek-V2]	⭐️⭐️
2024.12	🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai)	[pdf]	[DeepSeek-V3]	⭐️⭐️
2025.01	🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai)	[pdf]	[DeepSeek-R1]	⭐️⭐️
2025.02	🔥🔥🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(@deepseek-ai)	[pdf]	⚠️	⭐️⭐️
2025.02	🔥🔥🔥[FlashMLA] DeepSeek FlashMLA(@deepseek-ai)	⚠️	[FlashMLA]	⭐️⭐️
2025.02	🔥🔥🔥[DualPipe] DeepSeek DualPipe(@deepseek-ai)	⚠️	[DualPipe]	⭐️⭐️
2025.02	🔥🔥🔥[DeepEP] DeepSeek DeepEP(@deepseek-ai)	⚠️	[DeepEP]	⭐️⭐️
2025.02	🔥🔥🔥[DeepGEMM] DeepSeek DeepGEMM(@deepseek-ai)	⚠️	[DeepGEMM]	⭐️⭐️
2025.02	🔥🔥🔥[EPLB] DeepSeek EPLB(@deepseek-ai)	⚠️	[EPLB]	⭐️⭐️
2025.02	🔥🔥🔥[3FS] DeepSeek 3FS(@deepseek-ai)	⚠️	[3FS]	⭐️⭐️
2025.03	🔥🔥🔥[推理系统] DeepSeek-V3 / R1 推理系统概览 (@deepseek-ai)	[blog]	⚠️	⭐️⭐️
2025.02	🔥🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs(@fudan.edu.cn)	[pdf]	[MHA2MLA]	⭐️⭐️
2025.02	🔥🔥[TransMLA] TransMLA: Multi-head Latent Attention Is All You Need(@PKU)	[pdf]	[TransMLA]	⭐️⭐️
2025.03	🔥🔥[X-EcoMLA] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(@AMD)	[pdf]	⚠️	⭐️⭐️

📖Multi-GPUs/Multi-Nodes Parallelism (©️back👆🏻)

Date	Title	Paper	Code	Recom
2019.10	🔥🔥[MP: ZeRO] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com)	[pdf]	[deepspeed]	⭐️⭐️
2020.05	🔥🔥[TP: Megatron-LM] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA)	[pdf]	[Megatron-LM]	⭐️⭐️
2022.05	🔥🔥[SP: Megatron-LM] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA)	[pdf]	[Megatron-LM]	⭐️⭐️
2023.05	🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley)	[pdf]	[RingAttention]	⭐️⭐️
2023.10	🔥🔥[SP: Ring Attention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley)	[pdf]	[RingAttention]	⭐️⭐️
2023.11	🔥🔥[SP: STRIPED ATTENTION] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc)	[pdf]	[striped_attention]	⭐️⭐️
2023.10	🔥🔥[SP: DEEPSPEED ULYSSES] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com)	[pdf]	[deepspeed]	⭐️⭐️
2024.03	🔥🔥[CP: Megatron-LM] Megatron-LM: Context parallelism overview(@NVIDIA)	[docs]	[Megatron-LM]	⭐️⭐️
2024.05	🔥🔥[SP: Unified Sequence Parallel (USP)] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent)	[pdf]	[long-context-attention]	⭐️⭐️
2024.11	🔥🔥[CP: Meta] Context Parallelis

Core symbols most depended-on inside this repo

Shape

Function 3

Languages

Python100%

Modules by API surface

download_pdfs.py3 symbols

For agents

$ claude mcp add Awesome-LLM-Inference \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact