hub / github.com/Blaizzy/mlx-audio

github.com/Blaizzy/mlx-audio @v0.4.4 sqlite

repository ↗ · DeepWiki ↗ · release v0.4.4 ↗

9,463 symbols 28,993 edges 693 files 2,589 documented · 27%

README

MLX-Audio

The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

Features
Installation
Quick Start
Supported Models
Model Examples
Web Interface \& API Server
Quantization
Swift
Requirements
License
Citation
Acknowledgements

Features

Fast inference optimized for Apple Silicon (M series chips)
Multiple model architectures for TTS, STT, and STS
Multilingual support across models
Voice customization and cloning capabilities
Adjustable speech speed control
Interactive web interface with 3D audio visualization
OpenAI-compatible REST API
Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
Swift package for iOS/macOS integration

Installation

Using pip

pip install mlx-audio

Using uv to install only the command line tools

Latest release from pypi:

uv tool install --force mlx-audio --prerelease=allow

Latest code from github:

uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow

For development or web interface:

git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev, server]"

Quick Start

Command Line

# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello, world!' --voice Chelsie

# With a different voice and language hint
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Welcome to MLX-Audio!' --voice Ethan --lang_code English

# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --play

# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --output_path ./my_audio

# Stream audio during generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream

# Stream audio during generation and save it to disk
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream --save

# Join multiple generated segments into one file
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text $'Hello!\nHow are you?' --voice Chelsie --join_audio

By default, when generation yields multiple segments, mlx-audio saves numbered files such as audio_000.wav and audio_001.wav. Use --join_audio to save one combined file instead. When using --stream, add --save to write the streamed audio to disk.

Python API

from mlx_audio.tts.utils import load_model

# Load model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")

# Generate speech
for result in model.generate(
    "Hello from MLX-Audio!",
    voice="Chelsie",
    lang_code="English",
):
    print(f"Generated {result.audio.shape[0]} samples")
    # result.audio contains the waveform as mx.array

Supported Models

Text-to-Speech (TTS)

Model	Description	Languages	Repo
Kokoro	Fast, high-quality multilingual TTS	EN, JA, ZH, FR, ES, IT, PT, HI	bf16, 8bit, 6bit, 4bit
KittenTTS	Compact KittenTTS 0.8 models for edge-friendly TTS	EN	nano, micro, mini, collection
Qwen3-TTS	Alibaba's multilingual TTS with voice design	ZH, EN, JA, KO, + more	mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16
Higgs Audio v3	4B conversational TTS with voice cloning and inline control tokens	100 languages	bosonai/higgs-audio-v3-tts-4b
OmniVoice	Zero-shot multilingual TTS with voice cloning, batch generation, and nonverbal tags	646+ languages	mlx-community/OmniVoice-bf16
CSM / MisoTTS	Sesame-style conversational speech models with voice cloning	EN	mlx-community/csm-1b, MisoTTS bf16, MisoTTS 8bit
Dia	Dialogue-focused TTS	EN	mlx-community/Dia-1.6B-fp16
OuteTTS	Efficient TTS model	EN	mlx-community/OuteTTS-1.0-0.6B-fp16
Spark	SparkTTS model	EN, ZH	mlx-community/Spark-TTS-0.5B-bf16
Chatterbox	Expressive multilingual TTS	EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO	mlx-community/chatterbox-fp16
Soprano	High-quality TTS	EN	mlx-community/Soprano-1.1-80M-bf16
Ming Omni TTS (BailingMM)	Multimodal generation with voice cloning, style control, and speech/music/event generation	EN, ZH	mlx-community/Ming-omni-tts-16.8B-A3B-bf16
Ming Omni TTS (Dense)	Lightweight dense Ming Omni variant for voice cloning and style control	EN, ZH	mlx-community/Ming-omni-tts-0.5B-bf16
KugelAudio	SOTA 7B AR+Diffusion TTS for European languages	EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, + 14 more	kugelaudio/kugelaudio-0-open
Voxtral TTS	Mistral's 4B multilingual TTS (20 voices, 9 languages)	EN, FR, ES, DE, IT, PT, NL, AR, HI	mlx-community/Voxtral-4B-TTS-2603-mlx-bf16
LongCat-AudioDiT	SOTA diffusion TTS in waveform latent space with voice cloning	ZH, EN	mlx-community/LongCat-AudioDiT-1B-bf16
MeloTTS	Lightweight VITS2-based TTS with streaming	EN (more coming)	mlx-community/MeloTTS-English-MLX
MOSS-TTS	8B delay-pattern and 1.7B local-transformer multilingual TTS with voice cloning	31 languages	OpenMOSS-Team/MOSS-TTS-v1.5, OpenMOSS-Team/MOSS-TTS, OpenMOSS-Team/MOSS-TTS-Local-Transformer
MOSS-TTS-Nano	Tiny multilingual voice-cloning TTS	20 languages	mlx-community/MOSS-TTS-Nano-100M
Higgs Audio v2	3B Llama-backed TTS with real-time voice cloning	EN, ZH, KO, DE, ES	bf16 (upstream), q8, q6

Speech-to-Text (STT)

Model	Description	Languages	Repo
Whisper	OpenAI's robust STT model	99+ languages	mlx-community/whisper-large-v3-turbo-asr-fp16
Distil-Whisper	Distilled fast Whisper variants	EN	distil-whisper/distil-large-v3
Qwen3-ASR	Alibaba's multilingual ASR	ZH, EN, JA, KO, + more	mlx-community/Qwen3-ASR-1.7B-8bit
Mega-ASR	Routed Qwen3-ASR with automatic clean/base vs degraded/LoRA switching	EN (fixtures), multilingual Qwen3-ASR backbone	README
Qwen3-ForcedAligner	Word-level audio alignment	ZH, EN, JA, KO, + more	mlx-community/Qwen3-ForcedAligner-0.6B-8bit
Parakeet	NVIDIA's accurate STT	EN (v2), 25 EU languages (v3)	mlx-community/parakeet-tdt-0.6b-v3
Nemotron 3.5 ASR (streaming)	NVIDIA's cache-aware streaming FastConformer-RNNT with language-ID prompting	40 language-locales	mlx-community/nemotron-3.5-asr-streaming-0.6b · README
Voxtral	Mistral's speech model	Multiple	mlx-community/Voxtral-Mini-3B-2507-bf16
Voxtral Realtime	Mistral's 4B streaming STT	Multiple	4bit, fp16
VibeVoice-ASR	Microsoft's 9B ASR with diarization & timestamps	Multiple	mlx-community/VibeVoice-ASR-bf16
Canary	NVIDIA's multilingual ASR with translation	25 EU + RU, UK	README
Moonshine	Useful Sensors' lightweight ASR	EN	README
MMS	Meta's massively multilingual ASR with adapters	1000+	README
Granite Speech	IBM's ASR + speech translation	EN, FR, DE, ES, PT, JA	README
Qwen2-Audio	Alibaba's multimodal audio understanding (ASR, captioning, emotion, translation)	Multiple	mlx-community/Qwen2-Audio-7B-Instruct-4bit

Voice Activity Detection / Speaker Diarization (VAD)

Model	Description	Languages	Repo
Silero VAD	Lightweight speech/non-speech detection with streaming state	Language-agnostic	mlx-community/silero-vad
Sortformer v1	NVIDIA's end-to-end speaker diarization (up to 4 speakers)	Language-agnostic	mlx-community/diar_sortformer_4spk-v1-fp32
Sortformer v2.1	NVIDIA's streaming speaker diarization with AOSC compression	Language-agnostic	mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32

See the model READMEs for API details, streaming examples, and conversion steps.

Speech-to-Speech (STS)

Model	Description	Use Case	Repo
SAM-Audio	Text-guided source separation	Extract specific sounds	mlx-community/sam-audio-large
Liquid2.5-Audio*	Speech-to-Speech, Text-to-Speech and Speech-to-Text	Speech interactions	mlx-community/LFM2.5-Audio-1.5B-8bit
MossFormer2 SE	Speech enhancement	Noise removal	starkdmi/MossFormer2_SE_48K_MLX
DeepFilterNet (1/2/3)	Speech enhancement	Noise suppression	mlx-community/DeepFilterNet-mlx

Model Examples

Qwen3-TTS

Alibaba's state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
    text="Hello, welcome to MLX-Audio!",
    voice="Chelsie",
    language="English",
))

audio = results[0].audio  # mx.array

See the Qwen3-TTS README for voice cloning, CustomVoice, VoiceDesign, and all available models.

OmniVoice

OmniVoice is a zero-shot multilingual TTS model for 646+ languages with voice cloning, batch generation, pronunciation controls, and nonverbal tags such as [laughter] and [sigh]. It uses a bidirectional Qwen3 backbone with iterative masked generation and a Hig

Extension points exported contracts — how you extend this code

LayoutWrapperProps (Interface)

(no doc)

mlx_audio/ui/components/layout-wrapper.tsx

VoiceLibraryProps (Interface)

(no doc)

mlx_audio/ui/components/voice-library.tsx

AudioOrbProps (Interface)

(no doc)

mlx_audio/ui/components/audio-orb.tsx

VoiceSelectionProps (Interface)

(no doc)

mlx_audio/ui/components/voice-selection.tsx

NavbarProps (Interface)

(no doc)

mlx_audio/ui/components/navbar.tsx

Core symbols most depended-on inside this repo

get

called by 906

mlx_audio/sts/voice_pipeline.py

append

called by 643

mlx_audio/tts/models/fish_qwen3_omni/prompt.py

items

called by 406

mlx_audio/sts/models/lfm_audio/processor.py

eval

called by 221

mlx_audio/tts/models/dia/dia.py

split

called by 214

mlx_audio/sts/models/mel_roformer/model.py

append

called by 201

mlx_audio/stt/models/voxtral_realtime/streaming.py

pad

called by 189

mlx_audio/stt/models/wav2vec/feature_extractor.py

append

called by 147

mlx_audio/sts/voice_pipeline.py

Shape

Method 5,950

Class 1,983

Function 1,477

Route 43

Interface 10

Languages

Python99%

TypeScript1%

Modules by API surface

mlx_audio/tts/tests/test_models.py659 symbols

mlx_audio/stt/tests/test_models.py128 symbols

mlx_audio/tts/models/bailingmm/bailingmm.py123 symbols

mlx_audio/codec/models/fish_s1_dac/fish_s1_dac.py112 symbols

mlx_audio/tts/models/qwen3_tts/speech_tokenizer.py97 symbols

mlx_audio/sts/voice_pipeline.py93 symbols

mlx_audio/server.py89 symbols

mlx_audio/sts/tests/test_voice_pipeline.py81 symbols

mlx_audio/codec/models/dacvae/codec.py78 symbols

mlx_audio/stt/models/cohere_asr/cohere_asr.py74 symbols

mlx_audio/vad/models/sortformer/sortformer.py73 symbols

mlx_audio/stt/models/vibevoice_asr/tests/test_vibevoice_asr.py72 symbols

Dependencies from manifests, versioned

@react-three/drei9.105.6 · 1×

@react-three/fiber8.16.8 · 1×

@types/bunlatest · 1×

@types/fluent-ffmpeg2.1.27 · 1×

@types/node20 · 1×

@types/react18 · 1×

@types/react-dom18 · 1×

D1.0.0 · 1×

class-variance-authority0.7.1 · 1×

clsx2.1.1 · 1×

eslint8 · 1×

eslint-config-next14.2.16 · 1×

For agents

$ claude mcp add mlx-audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/Blaizzy/mlx-audio @v0.4.4 sqlite

MLX-Audio

Table of Contents

Features

Installation

Using pip

Using uv to install only the command line tools

For development or web interface:

Quick Start

Command Line

Python API

Supported Models

Text-to-Speech (TTS)

Speech-to-Text (STT)

Voice Activity Detection / Speaker Diarization (VAD)

Speech-to-Speech (STS)

Model Examples

Qwen3-TTS

OmniVoice

Extension points exported contracts — how you extend this code

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Dependencies from manifests, versioned

For agents