hub / github.com/NVIDIA-NeMo/Speech

github.com/NVIDIA-NeMo/Speech @v2.7.3 sqlite

repository ↗ · DeepWiki ↗ · release v2.7.3 ↗

23,326 symbols 99,699 edges 2,307 files 10,585 documented · 45%

README

NVIDIA NeMo Speech Collection

Latest News

NVIDIA-Nemotron-3-Nano-30B-A3B is out with full reproducible script and recipes! Check out NeMo Megatron-Bridge, NeMo AutoModel, NeMo-RL and NGC container to try them! (2025-12-15)

⚠️ Pivot notice: This repo will pivot to focus on speech models collections only. Please refer to NeMo Framework Github Org for the complete list of repos under NeMo Framework

NeMo 2.0, with its support for LLMs and VLMs will be deprecated by 25.11, and replaced by NeMo Megatron-Bridge and NeMo AutoModel. More details can be found in the NeMo Framework GitHub org readme. (2025-10-10) Deprecated collections (will be removed in a later release): avlm · diffusion · llm · multimodal · multimodal-autoregressive · nlp · speechlm · vision · vlm

Pretrain and finetune :hugs:Hugging Face models via AutoModel

  NeMo Framework's latest feature AutoModel enables broad support for :hugs:Hugging Face models, with 25.04 focusing on

AutoModelForCausalLM in the Text Generation category
AutoModelForImageTextToText in the Image-Text-to-Text category

More Details in Blog: Run Hugging Face Models Instantly with Day-0 Support from NVIDIA NeMo Framework. Future releases will enable support for more model families such as Video Generation models.(2025-05-19)

Training on Blackwell using NeMo

  NeMo Framework has added Blackwell support, with <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html>performance benchmarks on GB200 & B200</a>. More optimizations to come in the upcoming releases.(2025-05-19)

Training Performance on GPU Tuning Guide

  NeMo Framework has published <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html>a comprehensive guide for performance tuning to achieve optimal throughput</a>! (2025-05-19)

New Models Support

  NeMo Framework has added support for latest community models - <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/llama4.html>Llama 4</a>, <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/vision/diffusionmodels/flux.html>Flux</a>, <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/llama_nemotron.html>Llama Nemotron</a>, <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/hyena.html#>Hyena & Evo2</a>, <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/vlms/qwen2vl.html>Qwen2-VL</a>, <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/qwen2.html>Qwen2.5</a>, Gemma3, Qwen3-30B&32B.(2025-05-19)

NeMo Framework 2.0

  We've released NeMo 2.0, an update on the NeMo Framework which prioritizes modularity and ease-of-use. Please refer to the <a href=https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/index.html>NeMo Framework User Guide</a> to get started.

New Cosmos World Foundation Models Support

Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform (2025-01-09)

    The end-to-end NVIDIA Cosmos platform accelerates world model development for physical AI systems. Built on CUDA, Cosmos combines state-of-the-art world foundation models, video tokenizers, and AI-accelerated data processing pipelines. Developers can accelerate world model development by fine-tuning Cosmos world foundation models or building new ones from the ground up. These models create realistic synthetic videos of environments and interactions, providing a scalable foundation for training complex systems, from simulating humanoid robots performing advanced actions to developing end-to-end autonomous driving models.














    <a href="https://developer.nvidia.com/blog/accelerate-custom-video-foundation-model-pipelines-with-new-nvidia-nemo-framework-capabilities/">
      Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities
    </a> (2025-01-07)



    The NeMo Framework now supports training and customizing the <a href="https://github.com/NVIDIA/Cosmos">NVIDIA Cosmos</a> collection of world foundation models. Cosmos leverages advanced text-to-world generation techniques to create fluid, coherent video content from natural language prompts.





    You can also now accelerate your video processing step using the <a href="https://developer.nvidia.com/nemo-curator-video-processing-early-access">NeMo Curator</a> library, which provides optimized video processing and captioning features that can deliver up to 89x faster video processing when compared to an unoptimized CPU pipeline.

Large Language Models and Multimodal Models

    <a href="https://developer.nvidia.com/blog/state-of-the-art-multimodal-generative-ai-model-development-with-nvidia-nemo/">
      State-of-the-Art Multimodal Generative AI Model Development with NVIDIA NeMo
    </a> (2024-11-06)



    NVIDIA recently announced significant enhancements to the NeMo platform, focusing on multimodal generative AI models. The update includes NeMo Curator and the Cosmos tokenizer, which streamline the data curation process and enhance the quality of visual data. These tools are designed to handle large-scale data efficiently, making it easier to develop high-quality AI models for various applications, including robotics and autonomous driving. The Cosmos tokenizers, in particular, efficiently map visual data into compact, semantic tokens, which is crucial for training large-scale generative models. The tokenizer is available now on the <a href=https://github.com/NVIDIA/cosmos-tokenizer>NVIDIA/cosmos-tokenizer</a> GitHub repo and on <a href=https://huggingface.co/nvidia/Cosmos-Tokenizer-CV8x8x8>Hugging Face</a>.














    <a href="https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/llama/index.html#new-llama-3-1-support for more information/">
    New Llama 3.1 Support
    </a> (2024-07-23)



    The NeMo Framework now supports training and customizing the Llama 3.1 collection of LLMs from Meta.














    <a href="https://aws.amazon.com/blogs/machine-learning/accelerate-your-generative-ai-distributed-training-workloads-with-the-nvidia-nemo-framework-on-amazon-eks/">
      Accelerate your Generative AI Distributed Training Workloads with the NVIDIA NeMo Framework on Amazon EKS
    </a> (2024-07-16)



 NVIDIA NeMo Framework now runs distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. For step-by-step instructions on creating an EKS cluster and running distributed training workloads with NeMo, see the GitHub repository <a href="https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher/EKS/"> here.</a>














    <a href="https://developer.nvidia.com/blog/nvidia-nemo-accelerates-llm-innovation-with-hybrid-state-space-model-support/">
      NVIDIA NeMo Accelerates LLM Innovation with Hybrid State Space Model Support
    </a> (2024/06/17)



 NVIDIA NeMo and Megatron Core now support pre-training and fine-tuning of state space models (SSMs). NeMo also supports training models based on the Griffin architecture as described by Google DeepMind.














    <a href="https://huggingface.co/models?sort=trending&search=nvidia%2Fnemotron-4-340B">
      NVIDIA releases 340B base, instruct, and reward models pretrained on a total of 9T tokens.
    </a> (2024-06-18)



  See documentation and tutorials for SFT, PEFT, and PTQ with
  <a href="https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/nemotron/index.html">
    Nemotron 340B
  </a>
  in the NeMo Framework User Guide.














    <a href="https://developer.nvidia.com/blog/nvidia-sets-new-generative-ai-performance-and-scale-records-in-mlperf-training-v4-0/">
      NVIDIA sets new generative AI performance and scale records in MLPerf Training v4.0
    </a> (2024/06/12)



  Using NVIDIA NeMo Framework and NVIDIA Hopper GPUs NVIDIA was able to scale to 11,616 H100 GPUs and achieve near-linear performance scaling on LLM pretraining.
  NVIDIA also achieved the highest LLM fine-tuning performance and raised the bar for text-to-image training.














      <a href="https://cloud.google.com/blog/products/compute/gke-and-nvidia-nemo-framework-to-train-generative-ai-models">
        Accelerate your generative AI journey with NVIDIA NeMo Framework on GKE
      </a> (2024/03/16)



    An end-to-end walkthrough to train generative AI models on the Google Kubernetes Engine (GKE) using the NVIDIA NeMo Framework is available at https://github.com/GoogleCloudPlatform/nvidia-nemo-on-gke.
    The walkthrough includes detailed instructions on how to set up a Google Cloud Project and pre-train a GPT model using the NeMo Framework.

Speech Recognition

    <a href="https://developer.nvidia.com/blog/accelerating-leaderboard-topping-asr-models-10x-with-nvidia-nemo/">
      Accelerating Leaderboard-Topping ASR Models 10x with NVIDIA NeMo
    </a> (2024/09/24)



  NVIDIA NeMo team released a number of inference optimizations for CTC, RNN-T, and TDT models that resulted in up to 10x inference speed-up.
  These models now exceed an inverse real-time factor (RTFx) of 2,000, with some reaching RTFx of even 6,000.














    <a href="https://developer.nvidia.com/blog/new-standard-for-speech-recognition-and-translation-from-the-nvidia-nemo-canary-model/">
      New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model
    </a> (2024/04/18)



  The NeMo team just released Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization.
  Canary also provides bi-directional translation, between English and the three other supported languages.














    <a href="https://developer.nvidia.com/blog/pushing-

Extension points exported contracts — how you extend this code

Window (Interface)

(no doc)

examples/voice_agent/client/src/app.ts

Core symbols most depended-on inside this repo

get

called by 2632

nemo/collections/asr/models/configs/diarizer_config.py

info

called by 2149

nemo/utils/nemo_logging.py

join

called by 1539

nemo/collections/common/callbacks/ema.py

size

called by 956

nemo/collections/asr/inference/streaming/framing/request.py

warning

called by 860

nemo/utils/nemo_logging.py

cat

called by 662

nemo/collections/diffusion/data/diffusion_taskencoder.py

exists

called by 609

nemo/export/tarutils.py

called by 592

nemo/collections/asr/parts/k2/graph_decoders.py

Shape

Method 13,938

Function 5,442

Class 3,389

Route 556

Interface 1

Languages

Python100%

TypeScript1%

Modules by API surface

nemo/collections/tts/modules/audio_codec_modules.py238 symbols

tests/core/test_typecheck.py138 symbols

nemo/collections/asr/parts/utils/streaming_utils.py128 symbols

nemo/lightning/megatron_parallel.py105 symbols

nemo/collections/tts/models/magpietts.py96 symbols

nemo/core/neural_types/elements.py93 symbols

nemo/collections/asr/modules/rnnt.py87 symbols

tests/collections/common/test_lhotse_dataloading.py84 symbols

nemo/collections/tts/losses/audio_codec_loss.py83 symbols

tests/collections/asr/test_asr_metrics.py82 symbols

nemo/collections/asr/models/classification_models.py79 symbols

nemo/collections/tts/modules/vits_modules.py78 symbols

Used by 4 indexed graphs manifest dependencies, hub-wide

github.com/botpress/botpress

github.com/fonoster/fonoster

github.com/leon-ai/leon

github.com/mastra-ai/mastra

Dependencies from manifests, versioned

@pipecat-ai/client-js0.4.0 · 1×

@pipecat-ai/websocket-transport0.4.1 · 1×

@types/node22.15.30 · 1×

@types/protobufjs6.0.0 · 1×

@types/react19.2.2 · 1×

@types/react-dom19.2.2 · 1×

@vitejs/plugin-react-swc3.10.1 · 1×

protobufjs7.4.0 · 1×

react19.2.0 · 1×

react-dom19.2.0 · 1×

typescript5.8.3 · 1×

vite6.3.5 · 1×

For agents

$ claude mcp add Speech \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact