hub / github.com/deepspeedai/DeepSpeed

github.com/deepspeedai/DeepSpeed @v0.19.2 sqlite

repository ↗ · DeepWiki ↗ · release v0.19.2 ↗

10,482 symbols 42,289 edges 1,030 files 1,920 documented · 18%

README

Office Hours

DeepSpeed hosts regular office hours on the last Tuesday of each month at 12:00 America/New_York to discuss development plans, features, etc. This meeting is public for anyone to join and ask questions. The meeting is hosted on Zoom and can be joined here.

Latest News

[2026/05] Using Muon Optimizer with DeepSpeed
[2026/05] System DMA (SDMA) for ZeRO-3: offload collectives off compute units on AMD GPUs for better overlap
[2026/03] DeepSpeed Team gave a tutorial at ASPLOS 2026 titled "Building Efficient Large-Scale Model Systems with DeepSpeed: From Open-Source Foundations to Emerging Research"
[2026/03] Our SuperOffload work received an Honorable Mention for the ASPLOS 2026 Best Paper Award
[2025/12] DeepSpeed Core API updates: PyTorch-style backward and low-precision master states
[2025/11] DeepSpeed ZeRO++ powers large-scale distillation training of LLMs for Recommendation Systems at LinkedIn
[2025/10] We hosted the Ray x DeepSpeed Meetup at Anyscale. We shared our most recent work on SuperOffload, ZenFlow, Muon Optimizer Support, Arctic Long Sequence Training and DeepCompile. Please find the meetup slides here.
[2025/10] SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
[2025/10] Study of ZenFlow and ZeRO offload performance with DeepSpeed CPU core binding
[2025/08] ZenFlow: Stall-Free Offloading Engine for LLM Training
[2025/06] Arctic Long Sequence Training (ALST) with DeepSpeed: Scalable And Efficient Training For Multi-Million Token Sequences
[2025/06] DeepNVMe: Affordable I/O scaling for Deep Learning Applications

More news

Extreme Speed and Scale for DL Training

DeepSpeed enabled the world's most powerful language models (at the time of this writing) such as MT-530B and BLOOM. DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations include ZeRO, ZeRO-Infinity, 3D-Parallelism, Ulysses Sequence Parallelism, DeepSpeed-MoE, etc.

DeepSpeed Adoption

DeepSpeed was an important part of Microsoft’s AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

	Documentation
	Transformers with DeepSpeed
	Accelerate with DeepSpeed
	Lightning with DeepSpeed
	MosaicML with DeepSpeed
	Determined with DeepSpeed
	MMEngine with DeepSpeed

Build Pipeline Status

Description	Status
NVIDIA
AMD
CPU
Intel Gaudi
Intel XPU
Integrations
Misc
Huawei Ascend NPU

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Requirements

PyTorch must be installed before installing DeepSpeed.
For full feature support we recommend a version of PyTorch that is >= 2.0 and ideally the latest PyTorch stable release.
A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
AMD: MI100 and MI200

Contributed HW support

DeepSpeed now support various HW accelerators.

Contributor	Hardware	Accelerator Name	Contributor validated	Upstream validated
Huawei	Huawei Ascend NPU	npu	Yes	No
Intel	Intel(R) Gaudi(R) 2 AI accelerator	hpu	Yes	Yes
Intel	Intel(R) Xeon(R) Processors	cpu	Yes	Yes
Intel	Intel(R) Data Center GPU Max series	xpu	Yes	Yes
Tecorigin	Scalable Data Analytics Accelerator	sdaa	Yes	No

PyPI

We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to

Core symbols most depended-on inside this repo

get_accelerator

called by 1649

accelerator/real_accelerator.py

append

called by 1248

deepspeed/utils/comms_logging.py

called by 856

deepspeed/ops/fp_quantizer/quantize.py

numel

called by 519

deepspeed/runtime/swap_tensor/optimizer_utils.py

parameters

called by 423

deepspeed/inference/v2/checkpoint/base_engine.py

device_name

called by 357

accelerator/hpu_accelerator.py

initialize

called by 318

deepspeed/inference/v2/inference_parameter.py

reshape

called by 311

deepspeed/checkpoint/reshape_3d_utils.py

Shape

Method 6,473

Function 2,504

Class 1,473

Route 32

Languages

Python100%

Modules by API surface

deepspeed/runtime/engine.py311 symbols

deepspeed/runtime/zero/stage3.py169 symbols

deepspeed/runtime/zero/stage_1_and_2.py145 symbols

deepspeed/runtime/zero/partition_parameters.py119 symbols

deepspeed/module_inject/layers.py114 symbols

tests/unit/v1/zero/test_zero.py113 symbols

deepspeed/runtime/utils.py83 symbols

deepspeed/runtime/data_pipeline/data_sampling/indexed_dataset.py77 symbols

deepspeed/runtime/config.py76 symbols

deepspeed/profiling/flops_profiler/profiler.py75 symbols

accelerator/cuda_accelerator.py74 symbols

accelerator/hpu_accelerator.py72 symbols

Used by 15 indexed graphs manifest dependencies, hub-wide

github.com/2U1/Qwen-VL-Series-Finetune

github.com/FunAudioLLM/CosyVoice

github.com/LAION-AI/Open-Assistant

github.com/LlamaChinese/Llama-Chinese

github.com/Netflix/void-model

github.com/OpenBMB/ToolBench

github.com/PKU-YuanGroup/Helios

github.com/PrimeIntellect-ai/verifiers

github.com/Yuliang-Liu/Monkey

github.com/baichuan-inc/Baichuan-7B

… +5 more

Dependencies from manifests, versioned

autodoc_pydantic2.0.0 · 1×

clang-format18.1.3 · 1×

comet_ml3.41.0 · 1×

diffusers0.25.0 · 1×

importlib-metadata4 · 1×

lm-eval0.3.0 · 1×

neural-compressor2.1.0 · 1×

packaging20.0 · 1×

pre-commit3.2.0 · 1×

pydantic2.0.0 · 1×

pytest7.2.0 · 1×

qtorch0.3.0 · 1×

For agents

$ claude mcp add DeepSpeed \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact