MCPcopy Index your code
hub / github.com/espnet/espnet

github.com/espnet/espnet @v.202604-patch1

repository ↗ · DeepWiki ↗ · release v.202604-patch1 ↗ · Ask this repo → · + Follow
11,578 symbols 50,404 edges 2,073 files 4,144 documented · 36% updated 4d agov.202604-patch1 · 2026-04-22★ 9,87716 open issues
README

ESPnet: end-to-end speech processing toolkit

system/pytorch ver. 2.9.1 2.10.0 2.11.0
ubuntu/python3.10/pip ci on ubuntu ci on ubuntu ci on ubuntu
ubuntu/python3.12/pip ci on ubuntu ci on ubuntu ci on ubuntu
ubuntu/python3.10/conda ci on debian12
debian12/python3.10/conda ci on debian12
windows/python3.10/pip ci on windows
macos/python3.10/pip ci on macos
macos/python3.10/conda ci on macos

PyPI version Python Versions Downloads GitHub license codecov Code style: black Imports: isort pre-commit.ci status Mergify Status Discord


Docs | Example (ESPnet2) | Docker | Notebook


ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

Key Features

Kaldi-style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
  • Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
  • Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
  • Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
  • Support voice conversion recipe (VCC2020 baseline)
  • Support speaker diarization recipe (mini_librispeech, librimix)
  • Support singing voice synthesis recipe (ofuton_p_utagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
  • Fast/accurate training with CTC/attention multitask training
  • CTC/attention joint decoding to boost monotonic alignment decoding
  • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
  • Decoder: RNN (LSTM/GRU), Transformer, or S4
  • Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
  • Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
  • Batch GPU decoding
  • Data augmentation
  • Transducer based end-to-end ASR
  • Architecture:
    • Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
    • Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
    • Pre-encoder: VGG2L or Conv2D available.
  • Search algorithms:
  • Features:
    • Unified interface for offline and streaming speech recognition.
    • Multi-task learning with various auxiliary losses:
    • Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
    • Decoder: cross-entropy w/ label smoothing.
    • Transfer learning with an acoustic model and/or language model.
    • Training with FastEmit regularization method [Yu et al., 2021].

      Please refer to the tutorial page for complete documentation.

  • CTC segmentation
  • Non-autoregressive model based on Mask-CTC
  • ASR examples for supporting endangered language documentation (Please refer to egs/puebla_nahuatl and egs/yoloxochitl_mixtec for details)
  • Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
  • Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
  • Set frontend to s3prl
  • Select any upstream model by setting the frontend_conf to the corresponding name.
  • Transfer Learning :
  • easy usage and transfers from models previously trained by your group or models from ESPnet Hugging Face repository.
  • Documentation and toy example runnable on colab.
  • Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
  • Restricted Self-Attention based on Longformer as an encoder for long sequences
  • OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration - Real-time ASR demo with ESPnet2 Open In Colab - Gradio Web Demo on Hugging Face Spaces. Check out the Web Demo - Streaming Transformer ASR Local Demo with ESPnet2.

TTS: Text-to-speech

  • Architecture
    • Tacotron2
    • Transformer-TTS
    • FastSpeech
    • FastSpeech2
    • Conformer FastSpeech & FastSpeech2
    • VITS
    • JETS
  • Multi-speaker & multi-language extension
    • Pre-trained speaker embedding (e.g., X-vector)
    • Speaker ID embedding
    • Language ID embedding
    • Global style token (GST) embedding
    • Mix of the above embeddings
  • End-to-end training
    • End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
    • Joint training of text2mel and vocoder
  • Various language support
    • En / Jp / Zn / De / Ru / And more...
  • Integration with neural vocoders
    • Parallel WaveGAN
    • MelGAN
    • Multi-band MelGAN
    • HiFiGAN
    • StyleMelGAN
    • Mix of the above models

Demonstration - Real-time TTS demo with ESPnet2 Open In Colab - Integrated to Hugging Face Spaces with Gradio. See demo: Hugging Face Spaces

To train the neural vocoder, please check the following repositories: - kan-bayashi/ParallelWaveGAN - r9y9/wavenet_vocoder

SE: Speech enhancement (and separation)

  • Single-speaker speech enhancement
  • Multi-speaker speech separation
  • Unified encoder-separator-decoder structure for time-domain and frequency-domain models
  • Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
  • Separators: BLSTM, Transformer, Conformer, TasNet, DPRNN, SkiM, SVoice, DC-CRN, DCCRN, Deep Clustering, Deep Attractor Network, FaSNet, iFaSNet, Neural Beamformers, etc.
  • Flexible ASR integration: working as an individual task or as the ASR frontend
  • Easy to import pre-trained models from Asteroid
  • Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration - Interactive SE demo with ESPnet2 Open In Colab - Streaming SE demo with ESPnet2 Open In Colab

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer-based end-to-end ST (new!)
  • Transformer

Core symbols most depended-on inside this repo

append
called by 2923
espnet2/sds/utils/chat.py
format
called by 2487
egs2/cmu_kids/asr1/local/sph2wav.py
split
called by 1523
egs2/pjs/svs1/local/prep_segments.py
size
called by 1463
egs2/ml_superb/asr1/local/linguistic_tree.py
update
called by 443
espnet2/asr/encoder/beats_encoder.py
items
called by 412
espnet2/legacy/utils/io_utils.py
open
called by 376
egs2/cmu_kids/asr1/local/sph2wav.py
close
called by 345
espnet2/fileio/score_scp.py

Shape

Method 5,539
Function 4,341
Class 1,556
Route 142

Languages

Python100%

Modules by API surface

egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/gru.py98 symbols
egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/basic_layers.py91 symbols
test/espnet2/speechlm/model/test_speechlm_job.py82 symbols
espnet2/train/preprocessor.py82 symbols
egs2/TEMPLATE/asr1/steps/libs/nnet3/xconfig/trivial_layers.py80 symbols
espnet2/enh/layers/ncsnpp_utils/layers.py67 symbols
test/espnet2/speechlm/model/test_parallel.py65 symbols
espnet2/legacy/nets/pytorch_backend/rnn/attentions.py65 symbols
espnet2/asr/state_spaces/s4.py58 symbols
test/espnet3/components/modeling/test_model_with_optim_scheduler.py55 symbols
test/espnet2/speechlm/model/speechlm/multimodal_io/test_audio.py55 symbols
espnet2/train/dataset.py53 symbols

Dependencies from manifests, versioned

@vuepress/bundler-vite2.0.0-rc.14 · 1×
js-yaml4.1.1 · 1×
vue3.4.31 · 1×
vuepress2.0.0-rc.14 · 1×
vuepress-plugin-search-pro2.0.0-rc.51 · 1×
vuepress-theme-hope2.0.0-rc.51 · 1×
PyYAML5.1.2 · 1×
configargparse1.2.1 · 1×
datasets
espnet_model_zoo
humanfriendly

For agents

$ claude mcp add espnet \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact