MCPcopy
hub / github.com/mlcommons/training

github.com/mlcommons/training @v4.0 sqlite

repository ↗ · DeepWiki ↗ · release v4.0 ↗
5,970 symbols 20,155 edges 691 files 2,103 documented · 35%
README

MLPerf™ Training Reference Implementations

This is a repository of reference implementations for the MLPerf training benchmarks. These implementations are valid as starting points for benchmark implementations but are not fully optimized and are not intended to be used for "real" performance measurements of software frameworks or hardware.

Please see the MLPerf Training Benchmark paper for a detailed description of the motivation and guiding principles behind the benchmark suite. If you use any part of this benchmark (e.g., reference implementations, submissions, etc.) in academic work, please cite the following:

@misc{mattson2019mlperf,
    title={MLPerf Training Benchmark},
    author={Peter Mattson and Christine Cheng and Cody Coleman and Greg Diamos and Paulius Micikevicius and David Patterson and Hanlin Tang and Gu-Yeon Wei and Peter Bailis and Victor Bittorf and David Brooks and Dehao Chen and Debojyoti Dutta and Udit Gupta and Kim Hazelwood and Andrew Hock and Xinyuan Huang and Atsushi Ike and Bill Jia and Daniel Kang and David Kanter and Naveen Kumar and Jeffery Liao and Guokai Ma and Deepak Narayanan and Tayo Oguntebi and Gennady Pekhimenko and Lillian Pentecost and Vijay Janapa Reddi and Taylor Robie and Tom St. John and Tsuguchika Tabaru and Carole-Jean Wu and Lingjie Xu and Masafumi Yamazaki and Cliff Young and Matei Zaharia},
    year={2019},
    eprint={1910.01500},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

These reference implementations are still very much "alpha" or "beta" quality. They could be improved in many ways. Please file issues or pull requests to help us improve quality.

Contents

We provide reference implementations for benchmarks in the MLPerf suite, as well as several benchmarks under development.

Each reference implementation provides the following:

  • Code that implements the model in at least one framework.
  • A Dockerfile which can be used to run the benchmark in a container.
  • A script which downloads the appropriate dataset.
  • A script which runs and times training the model.
  • Documentation on the dataset, model, and machine setup.

Running Benchmarks

Follow instructions on the Readme of each benchmark. Generally, a benchmark can be run with the following steps:

  1. Setup docker & dependencies. There is a shared script (install_cuda_docker.sh) to do this. Some benchmarks will have additional setup, mentioned in their READMEs.
  2. Download the dataset using ./download_dataset.sh. This should be run outside of docker, on your host machine. This should be run from the directory it is in (it may make assumptions about CWD).
  3. Optionally, run verify_dataset.sh to ensure the was successfully downloaded.
  4. Build and run the docker image, the command to do this is included with each Benchmark.

Each benchmark will run until the target quality is reached and then stop, printing timing results.

Some these benchmarks are rather slow or take a long time to run on the reference hardware. We expect to see significant performance improvements with more hardware and optimized implementations.

MLPerf Training v4.0 (Submission Deadline May 10, 2024)

*Framework here is given for the reference implementation. Submitters are free to use their own frameworks to run the benchmark.

model reference implementation framework dataset
resnet50v1.5 vision/classification_and_detection tensorflow2 Imagenet
RetinaNet vision/object detection pytorch OpenImages
3DUnet vision/image segmentation pytorch KiTS19
Stable Diffusionv2 image generation pytorch LAION-400M-filtered
BERT-large language/nlp tensorflow Wikipedia 2020/01/01
GPT3 language/llm paxml,megatron-lm C4
LLama2 70B-LoRA language/LLM fine-tuning pytorch SCROLLS govtReport
DLRMv2 recommendation torchrec Criteo 4TB multi-hot
RGAT GNN pytorch IGBFull

Extension points exported contracts — how you extend this code

Annotation (Interface)
(no doc)
retired_benchmarks/minigo/tensorflow/minigo/minigui/position.ts
Definition (Interface)
(no doc)
retired_benchmarks/minigo/tensorflow/minigo/minigui/position.ts
Variation (Interface)
(no doc)
retired_benchmarks/minigo/tensorflow/minigo/minigui/position.ts
TreeStats (Interface)
(no doc)
retired_benchmarks/minigo/tensorflow/minigo/minigui/position.ts
Update (Interface)
(no doc)
retired_benchmarks/minigo/tensorflow/minigo/minigui/position.ts

Core symbols most depended-on inside this repo

append
called by 950
image_segmentation/pytorch/preprocess_dataset.py
print
called by 881
single_stage_detector/ssd/utils.py
info
called by 366
retired_benchmarks/ssd-v1/ssd/coco.py
print_rank_0
called by 242
large_language_model/megatron-lm/megatron/utils.py
size
called by 192
large_language_model/megatron-lm/megatron/data/indexed_dataset.py
to
called by 140
single_stage_detector/ssd/model/image_list.py
max
called by 135
single_stage_detector/ssd/utils.py
get
called by 134
large_language_model/megatron-lm/megatron/model/distributed.py

Shape

Method 2,969
Function 2,152
Class 831
Interface 12
Enum 3
Route 3

Languages

Python90%
TypeScript10%

Modules by API surface

large_language_model/megatron-lm/megatron/data/indexed_dataset.py71 symbols
stable_diffusion/ldm/models/diffusion/ddpm.py68 symbols
retired_benchmarks/minigo/tensorflow/minigo/minigui/static/layer.js65 symbols
retired_benchmarks/minigo/tensorflow/minigo/minigui/layer.ts53 symbols
large_language_model/megatron-lm/megatron/optimizer/optimizer.py53 symbols
stable_diffusion/ldm/modules/diffusionmodules/model.py52 symbols
retired_benchmarks/transformer/tensorflow/bert/run_classifier.py49 symbols
retired_benchmarks/resnet-tf1/official/resnet/imagenet_test.py45 symbols
large_language_model/paxml/c4.py44 symbols
large_language_model/megatron-lm/megatron/tokenizer/tokenizer.py42 symbols
retired_benchmarks/ssd-v1/ssd/utils.py41 symbols
large_language_model/megatron-lm/megatron/mpu/mappings.py41 symbols

Dependencies from manifests, versioned

Cython0.28.4 · 1×
Markdown2.6.11 · 1×
Pillow5.2.0 · 1×
Werkzeug0.14.1 · 1×
absl-py0.2.0 · 1×
accelerate0.27.2 · 1×
albumentations1.3.0 · 1×
astor0.6.2 · 1×
autopep81.3 · 1×
bitsandbytes0.37.2 · 1×
bleach1.5.0 · 1×
cachetools2.0.1 · 1×

For agents

$ claude mcp add training \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact