The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
pip install mlx-audio
Latest release from pypi:
uv tool install --force mlx-audio --prerelease=allow
Latest code from github:
uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev, server]"
# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello, world!' --voice Chelsie
# With a different voice and language hint
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Welcome to MLX-Audio!' --voice Ethan --lang_code English
# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --play
# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --output_path ./my_audio
# Stream audio during generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream
# Stream audio during generation and save it to disk
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream --save
# Join multiple generated segments into one file
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text $'Hello!\nHow are you?' --voice Chelsie --join_audio
By default, when generation yields multiple segments, mlx-audio saves numbered files such as audio_000.wav and audio_001.wav. Use --join_audio to save one combined file instead. When using --stream, add --save to write the streamed audio to disk.
from mlx_audio.tts.utils import load_model
# Load model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")
# Generate speech
for result in model.generate(
"Hello from MLX-Audio!",
voice="Chelsie",
lang_code="English",
):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.array
| Model | Description | Languages | Repo |
|---|---|---|---|
| Kokoro | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | bf16, 8bit, 6bit, 4bit |
| KittenTTS | Compact KittenTTS 0.8 models for edge-friendly TTS | EN | nano, micro, mini, collection |
| Qwen3-TTS | Alibaba's multilingual TTS with voice design | ZH, EN, JA, KO, + more | mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 |
| Higgs Audio v3 | 4B conversational TTS with voice cloning and inline control tokens | 100 languages | bosonai/higgs-audio-v3-tts-4b |
| OmniVoice | Zero-shot multilingual TTS with voice cloning, batch generation, and nonverbal tags | 646+ languages | mlx-community/OmniVoice-bf16 |
| CSM / MisoTTS | Sesame-style conversational speech models with voice cloning | EN | mlx-community/csm-1b, MisoTTS bf16, MisoTTS 8bit |
| Dia | Dialogue-focused TTS | EN | mlx-community/Dia-1.6B-fp16 |
| OuteTTS | Efficient TTS model | EN | mlx-community/OuteTTS-1.0-0.6B-fp16 |
| Spark | SparkTTS model | EN, ZH | mlx-community/Spark-TTS-0.5B-bf16 |
| Chatterbox | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | mlx-community/chatterbox-fp16 |
| Soprano | High-quality TTS | EN | mlx-community/Soprano-1.1-80M-bf16 |
| Ming Omni TTS (BailingMM) | Multimodal generation with voice cloning, style control, and speech/music/event generation | EN, ZH | mlx-community/Ming-omni-tts-16.8B-A3B-bf16 |
| Ming Omni TTS (Dense) | Lightweight dense Ming Omni variant for voice cloning and style control | EN, ZH | mlx-community/Ming-omni-tts-0.5B-bf16 |
| KugelAudio | SOTA 7B AR+Diffusion TTS for European languages | EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, + 14 more | kugelaudio/kugelaudio-0-open |
| Voxtral TTS | Mistral's 4B multilingual TTS (20 voices, 9 languages) | EN, FR, ES, DE, IT, PT, NL, AR, HI | mlx-community/Voxtral-4B-TTS-2603-mlx-bf16 |
| LongCat-AudioDiT | SOTA diffusion TTS in waveform latent space with voice cloning | ZH, EN | mlx-community/LongCat-AudioDiT-1B-bf16 |
| MeloTTS | Lightweight VITS2-based TTS with streaming | EN (more coming) | mlx-community/MeloTTS-English-MLX |
| MOSS-TTS | 8B delay-pattern and 1.7B local-transformer multilingual TTS with voice cloning | 31 languages | OpenMOSS-Team/MOSS-TTS-v1.5, OpenMOSS-Team/MOSS-TTS, OpenMOSS-Team/MOSS-TTS-Local-Transformer |
| MOSS-TTS-Nano | Tiny multilingual voice-cloning TTS | 20 languages | mlx-community/MOSS-TTS-Nano-100M |
| Higgs Audio v2 | 3B Llama-backed TTS with real-time voice cloning | EN, ZH, KO, DE, ES | bf16 (upstream), q8, q6 |
| Model | Description | Languages | Repo |
|---|---|---|---|
| Whisper | OpenAI's robust STT model | 99+ languages | mlx-community/whisper-large-v3-turbo-asr-fp16 |
| Distil-Whisper | Distilled fast Whisper variants | EN | distil-whisper/distil-large-v3 |
| Qwen3-ASR | Alibaba's multilingual ASR | ZH, EN, JA, KO, + more | mlx-community/Qwen3-ASR-1.7B-8bit |
| Mega-ASR | Routed Qwen3-ASR with automatic clean/base vs degraded/LoRA switching | EN (fixtures), multilingual Qwen3-ASR backbone | README |
| Qwen3-ForcedAligner | Word-level audio alignment | ZH, EN, JA, KO, + more | mlx-community/Qwen3-ForcedAligner-0.6B-8bit |
| Parakeet | NVIDIA's accurate STT | EN (v2), 25 EU languages (v3) | mlx-community/parakeet-tdt-0.6b-v3 |
| Nemotron 3.5 ASR (streaming) | NVIDIA's cache-aware streaming FastConformer-RNNT with language-ID prompting | 40 language-locales | mlx-community/nemotron-3.5-asr-streaming-0.6b · README |
| Voxtral | Mistral's speech model | Multiple | mlx-community/Voxtral-Mini-3B-2507-bf16 |
| Voxtral Realtime | Mistral's 4B streaming STT | Multiple | 4bit, fp16 |
| VibeVoice-ASR | Microsoft's 9B ASR with diarization & timestamps | Multiple | mlx-community/VibeVoice-ASR-bf16 |
| Canary | NVIDIA's multilingual ASR with translation | 25 EU + RU, UK | README |
| Moonshine | Useful Sensors' lightweight ASR | EN | README |
| MMS | Meta's massively multilingual ASR with adapters | 1000+ | README |
| Granite Speech | IBM's ASR + speech translation | EN, FR, DE, ES, PT, JA | README |
| Qwen2-Audio | Alibaba's multimodal audio understanding (ASR, captioning, emotion, translation) | Multiple | mlx-community/Qwen2-Audio-7B-Instruct-4bit |
| Model | Description | Languages | Repo |
|---|---|---|---|
| Silero VAD | Lightweight speech/non-speech detection with streaming state | Language-agnostic | mlx-community/silero-vad |
| Sortformer v1 | NVIDIA's end-to-end speaker diarization (up to 4 speakers) | Language-agnostic | mlx-community/diar_sortformer_4spk-v1-fp32 |
| Sortformer v2.1 | NVIDIA's streaming speaker diarization with AOSC compression | Language-agnostic | mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32 |
See the model READMEs for API details, streaming examples, and conversion steps.
| Model | Description | Use Case | Repo |
|---|---|---|---|
| SAM-Audio | Text-guided source separation | Extract specific sounds | mlx-community/sam-audio-large |
| Liquid2.5-Audio* | Speech-to-Speech, Text-to-Speech and Speech-to-Text | Speech interactions | mlx-community/LFM2.5-Audio-1.5B-8bit |
| MossFormer2 SE | Speech enhancement | Noise removal | starkdmi/MossFormer2_SE_48K_MLX |
| DeepFilterNet (1/2/3) | Speech enhancement | Noise suppression | mlx-community/DeepFilterNet-mlx |
Alibaba's state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
text="Hello, welcome to MLX-Audio!",
voice="Chelsie",
language="English",
))
audio = results[0].audio # mx.array
See the Qwen3-TTS README for voice cloning, CustomVoice, VoiceDesign, and all available models.
OmniVoice is a zero-shot multilingual TTS model for 646+ languages with voice cloning, batch generation, pronunciation controls, and nonverbal tags such as [laughter] and [sigh]. It uses a bidirectional Qwen3 backbone with iterative masked generation and a Hig
$ claude mcp add mlx-audio \
-- python -m otcore.mcp_server <graph>