hub / github.com/ashleve/lightning-hydra-template

github.com/ashleve/lightning-hydra-template @v2.0.3 sqlite

repository ↗ · DeepWiki ↗ · release v2.0.3 ↗

64 symbols 335 edges 29 files 63 documented · 98%

README

Lightning-Hydra-Template

A clean template to kickstart your deep learning project 🚀⚡🔥

Click on Use this template to initialize new repository.

Suggestions are always welcome!

📌 Introduction

Why you might want to use it:

✅ Save on boilerplate

Easily add new models, datasets, tasks, experiments, and train on different accelerators, like multi-GPU, TPU or SLURM clusters.

✅ Education

Thoroughly commented. You can use this repo as a learning resource.

✅ Reusability

Collection of useful MLOps tools, configs, and code snippets. You can use this repo as a reference for various utilities.

Why you might not want to use it:

❌ Things break from time to time

Lightning and Hydra are still evolving and integrate many libraries, which means sometimes things break. For the list of currently known problems visit this page.

❌ Not adjusted for data engineering

Template is not really adjusted for building data pipelines that depend on each other. It's more efficient to use it for model prototyping on ready-to-use data.

❌ Overfitted to simple use case

The configuration setup is built with simple lightning training in mind. You might need to put some effort to adjust it for different use cases, e.g. lightning fabric.

❌ Might not support your workflow

For example, you can't resume hydra-based multirun or hyperparameter search.

Note: Keep in mind this is unofficial community project.

Main Technologies

PyTorch Lightning - a lightweight PyTorch wrapper for high-performance AI research. Think of it as a framework for organizing your PyTorch code.

Hydra - a framework for elegantly configuring complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.

Main Ideas

Rapid Experimentation: thanks to hydra command line superpowers
Minimal Boilerplate: thanks to automating pipelines with config instantiation
Main Configs: allow you to specify default training configuration
Experiment Configs: allow you to override chosen hyperparameters and version control experiments
Workflow: comes down to 4 simple steps
Experiment Tracking: Tensorboard, W&B, Neptune, Comet, MLFlow and CSVLogger
Logs: all logs (checkpoints, configs, etc.) are stored in a dynamically generated folder structure
Hyperparameter Search: simple search is effortless with Hydra plugins like Optuna Sweeper
Tests: generic, easy-to-adapt smoke tests for speeding up the development
Continuous Integration: automatically test and lint your repo with Github Actions
Best Practices: a couple of recommended tools, practices and standards

Project Structure

The directory structure of new project looks like this:

├── .github                   <- Github Actions workflows
│
├── configs                   <- Hydra configs
│   ├── callbacks                <- Callbacks configs
│   ├── data                     <- Data configs
│   ├── debug                    <- Debugging configs
│   ├── experiment               <- Experiment configs
│   ├── extras                   <- Extra utilities configs
│   ├── hparams_search           <- Hyperparameter search configs
│   ├── hydra                    <- Hydra configs
│   ├── local                    <- Local configs
│   ├── logger                   <- Logger configs
│   ├── model                    <- Model configs
│   ├── paths                    <- Project paths configs
│   ├── trainer                  <- Trainer configs
│   │
│   ├── eval.yaml             <- Main config for evaluation
│   └── train.yaml            <- Main config for training
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra and lightning loggers
│
├── notebooks              <- Jupyter notebooks. Naming convention is a number (for ordering),
│                             the creator's initials, and a short `-` delimited description,
│                             e.g. `1.0-jqp-initial-data-exploration.ipynb`.
│
├── scripts                <- Shell scripts
│
├── src                    <- Source code
│   ├── data                     <- Data scripts
│   ├── models                   <- Model scripts
│   ├── utils                    <- Utility scripts
│   │
│   ├── eval.py                  <- Run evaluation
│   └── train.py                 <- Run training
│
├── tests                  <- Tests of any kind
│
├── .env.example              <- Example of file for storing private environment variables
├── .gitignore                <- List of files ignored by git
├── .pre-commit-config.yaml   <- Configuration of pre-commit hooks for code formatting
├── .project-root             <- File for inferring the position of project root directory
├── environment.yaml          <- File for installing conda environment
├── Makefile                  <- Makefile with commands like `make train` or `make test`
├── pyproject.toml            <- Configuration options for testing and linting
├── requirements.txt          <- File for installing python dependencies
├── setup.py                  <- File for installing project as a package
└── README.md

🚀 Quickstart

# clone project
git clone https://github.com/ashleve/lightning-hydra-template
cd lightning-hydra-template

# [OPTIONAL] create conda environment
conda create -n myenv python=3.9
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Template contains example with MNIST classification.

When running python src/train.py you should see something like this:

⚡ Your Superpowers

Override any config parameter from command line

python train.py trainer.max_epochs=20 model.optimizer.lr=1e-4

Note: You can also add new parameters with + sign.

python train.py +model.new_param="owo"

Train on CPU, GPU, multi-GPU and TPU

# train on CPU
python train.py trainer=cpu

# train on 1 GPU
python train.py trainer=gpu

# train on TPU
python train.py +trainer.tpu_cores=8

# train with DDP (Distributed Data Parallel) (4 GPUs)
python train.py trainer=ddp trainer.devices=4

# train with DDP (Distributed Data Parallel) (8 GPUs, 2 nodes)
python train.py trainer=ddp trainer.devices=4 trainer.num_nodes=2

# simulate DDP on CPU processes
python train.py trainer=ddp_sim trainer.devices=2

# accelerate training on mac
python train.py trainer=mps

Warning: Currently there are problems with DDP mode, read this issue to learn more.

Train with mixed precision

# train with pytorch native automatic mixed precision (AMP)
python train.py trainer=gpu +trainer.precision=16

Train model with any logger available in PyTorch Lightning, like W&B or Tensorboard

# set project and entity names in `configs/logger/wandb`
wandb:
  project: "your_project_name"
  entity: "your_wandb_team_name"

# train model with Weights&Biases (link to wandb dashboard should appear in the terminal)
python train.py logger=wandb

Note: Lightning provides convenient integrations with most popular logging frameworks. Learn more here.

Note: Using wandb requires you to setup account first. After that just complete the config as below.

Note: Click here to see example wandb dashboard generated with this template.

Train model with chosen experiment config

python train.py experiment=example

Note: Experiment configs are placed in configs/experiment/.

Attach some callbacks to run

python train.py callbacks=default

Note: Callbacks can be used for things such as as model checkpointing, early stopping and many more.

Note: Callbacks configs are placed in configs/callbacks/.

Use different tricks available in Pytorch Lightning

# gradient clipping may be enabled to avoid exploding gradients
python train.py +trainer.gradient_clip_val=0.5

# run validation loop 4 times during a training epoch
python train.py +trainer.val_check_interval=0.25

# accumulate gradients
python train.py +trainer.accumulate_grad_batches=10

# terminate training after 12 hours
python train.py +trainer.max_time="00:12:00:00"

Note: PyTorch Lightning provides about 40+ useful trainer flags.

Easily debug

# runs 1 epoch in default debugging mode
# changes logging directory to `logs/debugs/...`
# sets level of all command line loggers to 'DEBUG'
# enforces debug-friendly configuration
python train.py debug=default

# run 1 train, val and test loop, using only 1 batch
python train.py debug=fdr

# print execution time profiling
python train.py debug=profiler

# try overfitting to 1 batch
python train.py debug=overfit

# raise exception if there are any numerical anomalies in tensors, like NaN or +/-inf
python train.py +trainer.detect_anomaly=true

# use only 20% of the data
python train.py +trainer.limit_train_batches=0.2 \
+trainer.limit_val_batches=0.2 +trainer.limit_test_batches=0.2

Note: Visit configs/debug/ for different debugging configs.

Resume training from checkpoint

python train.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Note: Currently loading ckpt doesn't resume logger experiment, but it will be supported in future Lightning release.

Evaluate checkpoint on test dataset

python eval.py ckpt_path="/path/to/ckpt/name.ckpt"

Note: Checkpoint can be either path or URL.

Create a sweep over hyperparameters

# this will run 6 experiments one after the other,
# each with different combination of batch_size and learning rate
python train.py -m data.batch_size=32,64,128 model.lr=0.001,0.0005

Note: Hydra composes configs lazily at job launch time. If you change code or configs after launching a job/sweep, the final composed configs might be impacted.

Create a sweep over hyperparameters with Optuna

# this will run hyperparameter search defined in `configs/hparams_search/mnist_optuna.yaml`
# over chosen experiment config
python train.py -m hparams_search=mnist_optuna experiment=example

Note: Using Optuna Sweeper doesn't require you to add any boilerplate to your code, everything is defined in a [single config

Core symbols most depended-on inside this repo

tests/helpers/package_available.py

run_sh_command

called by 5

tests/helpers/run_sh_command.py

model_step

called by 3

src/models/mnist_module.py

src/data/mnist_datamodule.py

Shape

Function 35

Method 25

Class 4

Languages

Python100%

Modules by API surface

src/models/mnist_module.py13 symbols

src/data/mnist_datamodule.py11 symbols

tests/test_train.py6 symbols

tests/test_sweeps.py5 symbols

tests/conftest.py4 symbols

src/utils/utils.py4 symbols

src/models/components/simple_dense_net.py3 symbols

tests/test_configs.py2 symbols

tests/helpers/run_if.py2 symbols

src/utils/rich_utils.py2 symbols

src/utils/instantiators.py2 symbols

src/train.py2 symbols

Dependencies from manifests, versioned

hydra-colorlog1.2.0 · 1×

hydra-core1.3.2 · 1×

hydra-optuna-sweeper1.2.0 · 1×

lightning2.0.0 · 1×

torch2.0.0 · 1×

torchmetrics0.11.4 · 1×

torchvision0.15.0 · 1×

For agents

$ claude mcp add lightning-hydra-template \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact