MCPcopy
hub / github.com/xlite-dev/Awesome-LLM-Inference

github.com/xlite-dev/Awesome-LLM-Inference @v2.6.20 sqlite

repository ↗ · DeepWiki ↗ · release v2.6.20 ↗
3 symbols 10 edges 1 files 2 documented · 67%
README

📒Introduction

Awesome-LLM-Inference: A curated list of 📙Awesome LLM Inference Papers with Codes. For Awesome Diffusion Inference, please check 📖Awesome-DiT-Inference . For CUDA learn notes, please check 📖LeetCUDA .

📖 News 🔥🔥

  • [2025-06-16]: DBCache: Dual Block Caching is release! A Training-free UNet-style Cache Acceleration for Diffusion Transformers! Feel free to take a try!

©️Citations

@misc{Awesome-LLM-Inference@2024,
  title={Awesome-LLM-Inference: A curated list of Awesome LLM Inference Papers with codes},
  url={https://github.com/xlite-dev/Awesome-LLM-Inference},
  note={Open-source software available at https://github.com/xlite-dev/Awesome-LLM-Inference},
  author={xlite-dev, liyucheng09 etc},
  year={2024}
}

🎉Awesome LLM Inference Papers with Codes

Awesome LLM Inference for Beginners.pdf: 500 pages, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), PagedAttention, RoPE, SmoothQuant, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ etc.

🎉Download All PDFs

python3 download_pdfs.py # The code is generated by Doubao AI

image

📖Contents

📖Trending LLM/VLM Topics (©️back👆🏻)

Date Title Paper Code Recom
2024.04 🔥🔥🔥[Open-Sora] Open-Sora: Democratizing Efficient Video Production for All(@hpcaitech) [docs] [Open-Sora] ⭐️⭐️
2024.04 🔥🔥🔥[Open-Sora Plan] Open-Sora Plan: This project aim to reproduce Sora (Open AI T2V model)(@PKU) [report] [Open-Sora-Plan] ⭐️⭐️
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) [pdf] [DeepSeek-V2] ⭐️⭐️
2024.05 🔥🔥[YOCO] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft) [pdf] [unilm-YOCO] ⭐️⭐️
2024.06 🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving(@Moonshot AI) [pdf] [Mooncake] ⭐️⭐️
2024.07 🔥🔥[FlashAttention-3] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision(@TriDao etc) [pdf] [flash-attention] ⭐️⭐️
2024.07 🔥🔥[MInference 1.0] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention(@Microsoft) [pdf] [MInference 1.0] ⭐️⭐️
2024.11 🔥🔥🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences(@NVIDIA) [pdf] [Star-Attention] ⭐️⭐️
2024.12 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai) [pdf] [DeepSeek-V3] ⭐️⭐️
2025.01 🔥🔥🔥 [MiniMax-Text-01] MiniMax-01: Scaling Foundation Models with Lightning Attention [report] [MiniMax-01] ⭐️⭐️
2025.01 🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai) [pdf] [DeepSeek-R1] ⭐️⭐️

📖DeepSeek/Multi-head Latent Attention(MLA) (©️back👆🏻)

Date Title Paper Code Recom
2024.05 🔥🔥🔥[DeepSeek-V2] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(@DeepSeek-AI) [pdf] [DeepSeek-V2] ⭐️⭐️
2024.12 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report(@deepseek-ai) [pdf] [DeepSeek-V3] ⭐️⭐️
2025.01 🔥🔥🔥[DeepSeek-R1] DeepSeek-R1 Technical Report(@deepseek-ai) [pdf] [DeepSeek-R1] ⭐️⭐️
2025.02 🔥🔥🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention(@deepseek-ai) [pdf] ⚠️ ⭐️⭐️
2025.02 🔥🔥🔥[FlashMLA] DeepSeek FlashMLA(@deepseek-ai) ⚠️ [FlashMLA] ⭐️⭐️
2025.02 🔥🔥🔥[DualPipe] DeepSeek DualPipe(@deepseek-ai) ⚠️ [DualPipe] ⭐️⭐️
2025.02 🔥🔥🔥[DeepEP] DeepSeek DeepEP(@deepseek-ai) ⚠️ [DeepEP] ⭐️⭐️
2025.02 🔥🔥🔥[DeepGEMM] DeepSeek DeepGEMM(@deepseek-ai) ⚠️ [DeepGEMM] ⭐️⭐️
2025.02 🔥🔥🔥[EPLB] DeepSeek EPLB(@deepseek-ai) ⚠️ [EPLB] ⭐️⭐️
2025.02 🔥🔥🔥[3FS] DeepSeek 3FS(@deepseek-ai) ⚠️ [3FS] ⭐️⭐️
2025.03 🔥🔥🔥[推理系统] DeepSeek-V3 / R1 推理系统概览 (@deepseek-ai) [blog] ⚠️ ⭐️⭐️
2025.02 🔥🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs(@fudan.edu.cn) [pdf] [MHA2MLA] ⭐️⭐️
2025.02 🔥🔥[TransMLA] TransMLA: Multi-head Latent Attention Is All You Need(@PKU) [pdf] [TransMLA] ⭐️⭐️
2025.03 🔥🔥[X-EcoMLA] X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression(@AMD) [pdf] ⚠️ ⭐️⭐️

📖Multi-GPUs/Multi-Nodes Parallelism (©️back👆🏻)

Date Title Paper Code Recom
2019.10 🔥🔥[MP: ZeRO] DeepSpeed-ZeRO: Memory Optimizations Toward Training Trillion Parameter Models(@microsoft.com) [pdf] [deepspeed] ⭐️⭐️
2020.05 🔥🔥[TP: Megatron-LM] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism(@NVIDIA) [pdf] [Megatron-LM] ⭐️⭐️
2022.05 🔥🔥[SP: Megatron-LM] Megatron-LM: Reducing Activation Recomputation in Large Transformer Models(@NVIDIA) [pdf] [Megatron-LM] ⭐️⭐️
2023.05 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models(@UC Berkeley) [pdf] [RingAttention] ⭐️⭐️
2023.10 🔥🔥[SP: Ring Attention] Ring Attention with Blockwise Transformers for Near-Infinite Context(@UC Berkeley) [pdf] [RingAttention] ⭐️⭐️
2023.11 🔥🔥[SP: STRIPED ATTENTION] STRIPED ATTENTION: FASTER RING ATTENTION FOR CAUSAL TRANSFORMERS(@MIT etc) [pdf] [striped_attention] ⭐️⭐️
2023.10 🔥🔥[SP: DEEPSPEED ULYSSES] DEEPSPEED ULYSSES: SYSTEM OPTIMIZATIONS FOR ENABLING TRAINING OF EXTREME LONG SEQUENCE TRANSFORMER MODELS(@microsoft.com) [pdf] [deepspeed] ⭐️⭐️
2024.03 🔥🔥[CP: Megatron-LM] Megatron-LM: Context parallelism overview(@NVIDIA) [docs] [Megatron-LM] ⭐️⭐️
2024.05 🔥🔥[SP: Unified Sequence Parallel (USP)] YunChang: A Unified Sequence Parallel (USP) Attention for Long Context LLM Model Training and Inference(@Tencent) [pdf] [long-context-attention] ⭐️⭐️
2024.11 🔥🔥[CP: Meta] Context Parallelis

Core symbols most depended-on inside this repo

extract_pdf_info
called by 1
download_pdfs.py
download_file
called by 1
download_pdfs.py
main
called by 1
download_pdfs.py

Shape

Function 3

Languages

Python100%

Modules by API surface

download_pdfs.py3 symbols

For agents

$ claude mcp add Awesome-LLM-Inference \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact