hub / github.com/espnet/espnet

github.com/espnet/espnet @v.202604-patch1

repository ↗ · DeepWiki ↗ · release v.202604-patch1 ↗ · Ask this repo → · + Follow

11,578 symbols 50,404 edges 2,073 files 4,144 documented · 36% ● updated 4d agov.202604-patch1 · 2026-04-22★ 9,87716 open issues

README

ESPnet: end-to-end speech processing toolkit

system/pytorch ver.	2.9.1	2.10.0	2.11.0
ubuntu/python3.10/pip
ubuntu/python3.12/pip
ubuntu/python3.10/conda
debian12/python3.10/conda
windows/python3.10/pip
macos/python3.10/pip
macos/python3.10/conda

Docs | Example (ESPnet2) | Docker | Notebook

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

2019 Tutorial at Interspeech
Material
2021 Tutorial at CMU
Online video
Material
2022 Tutorial at CMU
Usage of ESPnet (ASR as an example)
- Online video
- Material
Add new models/tasks to ESPnet
- Online video
- Material

Key Features

Kaldi-style complete recipe

Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
Support voice conversion recipe (VCC2020 baseline)
Support speaker diarization recipe (mini_librispeech, librimix)
Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
Hybrid CTC/attention based end-to-end ASR
Fast/accurate training with CTC/attention multitask training
CTC/attention joint decoding to boost monotonic alignment decoding
Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
Decoder: RNN (LSTM/GRU), Transformer, or S4
Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
Batch GPU decoding
Data augmentation
Transducer based end-to-end ASR
Architecture:
- Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
- Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
- Pre-encoder: VGG2L or Conv2D available.
Search algorithms:
- Greedy search constrained to one emission by timestep.
- Default beam search algorithm [Graves, 2012] without prefix search.
- Alignment-Length Synchronous decoding [Saon et al., 2020].
- Time Synchronous Decoding [Saon et al., 2020].
- N-step Constrained beam search modified from [Kim et al., 2020].
- modified Adaptive Expansion Search based on [Kim et al., 2021] and NSC.
Features:
- Unified interface for offline and streaming speech recognition.
- Multi-task learning with various auxiliary losses:
- Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
- Decoder: cross-entropy w/ label smoothing.
- Transfer learning with an acoustic model and/or language model.
- Training with FastEmit regularization method [Yu et al., 2021].
  
  Please refer to the tutorial page for complete documentation.
CTC segmentation
Non-autoregressive model based on Mask-CTC
ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
Set frontend to s3prl
Select any upstream model by setting the frontend_conf to the corresponding name.
Transfer Learning :
easy usage and transfers from models previously trained by your group or models from ESPnet Hugging Face repository.
Documentation and toy example runnable on colab.
Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
Restricted Self-Attention based on Longformer as an encoder for long sequences
OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration - Real-time ASR demo with ESPnet2 - Gradio Web Demo on Hugging Face Spaces. Check out the Web Demo - Streaming Transformer ASR Local Demo with ESPnet2.

TTS: Text-to-speech

Architecture
- Tacotron2
- Transformer-TTS
- FastSpeech
- FastSpeech2
- Conformer FastSpeech & FastSpeech2
- VITS
- JETS
Multi-speaker & multi-language extension
- Pre-trained speaker embedding (e.g., X-vector)
- Speaker ID embedding
- Language ID embedding
- Global style token (GST) embedding
- Mix of the above embeddings
End-to-end training
- End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
- Joint training of text2mel and vocoder
Various language support
- En / Jp / Zn / De / Ru / And more...
Integration with neural vocoders
- Parallel WaveGAN
- MelGAN
- Multi-band MelGAN
- HiFiGAN
- StyleMelGAN
- Mix of the above models

Demonstration - Real-time TTS demo with ESPnet2 - Integrated to Hugging Face Spaces with Gradio. See demo:

To train the neural vocoder, please check the following repositories: - kan-bayashi/ParallelWaveGAN - r9y9/wavenet_vocoder

SE: Speech enhancement (and separation)

Single-speaker speech enhancement
Multi-speaker speech separation
Unified encoder-separator-decoder structure for time-domain and frequency-domain models
Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
Separators: BLSTM, Transformer, Conformer, TasNet, DPRNN, SkiM, SVoice, DC-CRN, DCCRN, Deep Clustering, Deep Attractor Network, FaSNet, iFaSNet, Neural Beamformers, etc.
Flexible ASR integration: working as an individual task or as the ASR frontend
Easy to import pre-trained models from Asteroid
Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration - Interactive SE demo with ESPnet2 - Streaming SE demo with ESPnet2

ST: Speech Translation & MT: Machine Translation

State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
Transformer-based end-to-end ST (new!)
Transformer

Core symbols most depended-on inside this repo

append

called by 2923

espnet2/sds/utils/chat.py

format

called by 2487

egs2/cmu_kids/asr1/local/sph2wav.py

split

called by 1523

egs2/pjs/svs1/local/prep_segments.py

size

called by 1463

egs2/ml_superb/asr1/local/linguistic_tree.py

update

called by 443

espnet2/asr/encoder/beats_encoder.py

items

called by 412

espnet2/legacy/utils/io_utils.py

open

called by 376

egs2/cmu_kids/asr1/local/sph2wav.py

called by 345

espnet2/fileio/score_scp.py

Shape

Method 5,539

Function 4,341

Class 1,556

Route 142

Languages

Python100%

Modules by API surface

egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/gru.py98 symbols

egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/basic_layers.py91 symbols

test/espnet2/speechlm/model/test_speechlm_job.py82 symbols

espnet2/train/preprocessor.py82 symbols

egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/trivial_layers.py80 symbols

espnet2/enh/layers/ncsnpp_utils/layers.py67 symbols

test/espnet2/speechlm/model/test_parallel.py65 symbols

espnet2/legacy/nets/pytorch_backend/rnn/attentions.py65 symbols

espnet2/asr/state_spaces/s4.py58 symbols

test/espnet3/components/modeling/test_model_with_optim_scheduler.py55 symbols

test/espnet2/speechlm/model/speechlm/multimodal_io/test_audio.py55 symbols

espnet2/train/dataset.py53 symbols

Dependencies from manifests, versioned

@vuepress/bundler-vite2.0.0-rc.14 · 1×

js-yaml4.1.1 · 1×

vue3.4.31 · 1×

vuepress2.0.0-rc.14 · 1×

vuepress-plugin-search-pro2.0.0-rc.51 · 1×

vuepress-theme-hope2.0.0-rc.51 · 1×

PyYAML5.1.2 · 1×

configargparse1.2.1 · 1×

datasets1×

einops1×

espnet_model_zoo1×

humanfriendly1×

For agents

$ claude mcp add espnet \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact