MCPcopy
hub / github.com/Blaizzy/mlx-audio

github.com/Blaizzy/mlx-audio @v0.4.4 sqlite

repository ↗ · DeepWiki ↗ · release v0.4.4 ↗
9,463 symbols 28,993 edges 693 files 2,589 documented · 27%
README

MLX-Audio

Blaizzy%2Fmlx-audio | Trendshift

PyPI version Python License: MIT GitHub stars

The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.

Table of Contents

Features

  • Fast inference optimized for Apple Silicon (M series chips)
  • Multiple model architectures for TTS, STT, and STS
  • Multilingual support across models
  • Voice customization and cloning capabilities
  • Adjustable speech speed control
  • Interactive web interface with 3D audio visualization
  • OpenAI-compatible REST API
  • Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
  • Swift package for iOS/macOS integration

Installation

Using pip

pip install mlx-audio

Using uv to install only the command line tools

Latest release from pypi:

uv tool install --force mlx-audio --prerelease=allow

Latest code from github:

uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow

For development or web interface:

git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev, server]"

Quick Start

Command Line

# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello, world!' --voice Chelsie

# With a different voice and language hint
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Welcome to MLX-Audio!' --voice Ethan --lang_code English

# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --play

# Save to a specific directory
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --output_path ./my_audio

# Stream audio during generation
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream

# Stream audio during generation and save it to disk
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text 'Hello!' --voice Chelsie --stream --save

# Join multiple generated segments into one file
mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit --text $'Hello!\nHow are you?' --voice Chelsie --join_audio

By default, when generation yields multiple segments, mlx-audio saves numbered files such as audio_000.wav and audio_001.wav. Use --join_audio to save one combined file instead. When using --stream, add --save to write the streamed audio to disk.

Python API

from mlx_audio.tts.utils import load_model

# Load model
model = load_model("mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit")

# Generate speech
for result in model.generate(
    "Hello from MLX-Audio!",
    voice="Chelsie",
    lang_code="English",
):
    print(f"Generated {result.audio.shape[0]} samples")
    # result.audio contains the waveform as mx.array

Supported Models

Text-to-Speech (TTS)

Model Description Languages Repo
Kokoro Fast, high-quality multilingual TTS EN, JA, ZH, FR, ES, IT, PT, HI bf16, 8bit, 6bit, 4bit
KittenTTS Compact KittenTTS 0.8 models for edge-friendly TTS EN nano, micro, mini, collection
Qwen3-TTS Alibaba's multilingual TTS with voice design ZH, EN, JA, KO, + more mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16
Higgs Audio v3 4B conversational TTS with voice cloning and inline control tokens 100 languages bosonai/higgs-audio-v3-tts-4b
OmniVoice Zero-shot multilingual TTS with voice cloning, batch generation, and nonverbal tags 646+ languages mlx-community/OmniVoice-bf16
CSM / MisoTTS Sesame-style conversational speech models with voice cloning EN mlx-community/csm-1b, MisoTTS bf16, MisoTTS 8bit
Dia Dialogue-focused TTS EN mlx-community/Dia-1.6B-fp16
OuteTTS Efficient TTS model EN mlx-community/OuteTTS-1.0-0.6B-fp16
Spark SparkTTS model EN, ZH mlx-community/Spark-TTS-0.5B-bf16
Chatterbox Expressive multilingual TTS EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO mlx-community/chatterbox-fp16
Soprano High-quality TTS EN mlx-community/Soprano-1.1-80M-bf16
Ming Omni TTS (BailingMM) Multimodal generation with voice cloning, style control, and speech/music/event generation EN, ZH mlx-community/Ming-omni-tts-16.8B-A3B-bf16
Ming Omni TTS (Dense) Lightweight dense Ming Omni variant for voice cloning and style control EN, ZH mlx-community/Ming-omni-tts-0.5B-bf16
KugelAudio SOTA 7B AR+Diffusion TTS for European languages EN, DE, FR, ES, IT, PT, NL, PL, RU, UK, + 14 more kugelaudio/kugelaudio-0-open
Voxtral TTS Mistral's 4B multilingual TTS (20 voices, 9 languages) EN, FR, ES, DE, IT, PT, NL, AR, HI mlx-community/Voxtral-4B-TTS-2603-mlx-bf16
LongCat-AudioDiT SOTA diffusion TTS in waveform latent space with voice cloning ZH, EN mlx-community/LongCat-AudioDiT-1B-bf16
MeloTTS Lightweight VITS2-based TTS with streaming EN (more coming) mlx-community/MeloTTS-English-MLX
MOSS-TTS 8B delay-pattern and 1.7B local-transformer multilingual TTS with voice cloning 31 languages OpenMOSS-Team/MOSS-TTS-v1.5, OpenMOSS-Team/MOSS-TTS, OpenMOSS-Team/MOSS-TTS-Local-Transformer
MOSS-TTS-Nano Tiny multilingual voice-cloning TTS 20 languages mlx-community/MOSS-TTS-Nano-100M
Higgs Audio v2 3B Llama-backed TTS with real-time voice cloning EN, ZH, KO, DE, ES bf16 (upstream), q8, q6

Speech-to-Text (STT)

Model Description Languages Repo
Whisper OpenAI's robust STT model 99+ languages mlx-community/whisper-large-v3-turbo-asr-fp16
Distil-Whisper Distilled fast Whisper variants EN distil-whisper/distil-large-v3
Qwen3-ASR Alibaba's multilingual ASR ZH, EN, JA, KO, + more mlx-community/Qwen3-ASR-1.7B-8bit
Mega-ASR Routed Qwen3-ASR with automatic clean/base vs degraded/LoRA switching EN (fixtures), multilingual Qwen3-ASR backbone README
Qwen3-ForcedAligner Word-level audio alignment ZH, EN, JA, KO, + more mlx-community/Qwen3-ForcedAligner-0.6B-8bit
Parakeet NVIDIA's accurate STT EN (v2), 25 EU languages (v3) mlx-community/parakeet-tdt-0.6b-v3
Nemotron 3.5 ASR (streaming) NVIDIA's cache-aware streaming FastConformer-RNNT with language-ID prompting 40 language-locales mlx-community/nemotron-3.5-asr-streaming-0.6b · README
Voxtral Mistral's speech model Multiple mlx-community/Voxtral-Mini-3B-2507-bf16
Voxtral Realtime Mistral's 4B streaming STT Multiple 4bit, fp16
VibeVoice-ASR Microsoft's 9B ASR with diarization & timestamps Multiple mlx-community/VibeVoice-ASR-bf16
Canary NVIDIA's multilingual ASR with translation 25 EU + RU, UK README
Moonshine Useful Sensors' lightweight ASR EN README
MMS Meta's massively multilingual ASR with adapters 1000+ README
Granite Speech IBM's ASR + speech translation EN, FR, DE, ES, PT, JA README
Qwen2-Audio Alibaba's multimodal audio understanding (ASR, captioning, emotion, translation) Multiple mlx-community/Qwen2-Audio-7B-Instruct-4bit

Voice Activity Detection / Speaker Diarization (VAD)

Model Description Languages Repo
Silero VAD Lightweight speech/non-speech detection with streaming state Language-agnostic mlx-community/silero-vad
Sortformer v1 NVIDIA's end-to-end speaker diarization (up to 4 speakers) Language-agnostic mlx-community/diar_sortformer_4spk-v1-fp32
Sortformer v2.1 NVIDIA's streaming speaker diarization with AOSC compression Language-agnostic mlx-community/diar_streaming_sortformer_4spk-v2.1-fp32

See the model READMEs for API details, streaming examples, and conversion steps.

Speech-to-Speech (STS)

Model Description Use Case Repo
SAM-Audio Text-guided source separation Extract specific sounds mlx-community/sam-audio-large
Liquid2.5-Audio* Speech-to-Speech, Text-to-Speech and Speech-to-Text Speech interactions mlx-community/LFM2.5-Audio-1.5B-8bit
MossFormer2 SE Speech enhancement Noise removal starkdmi/MossFormer2_SE_48K_MLX
DeepFilterNet (1/2/3) Speech enhancement Noise suppression mlx-community/DeepFilterNet-mlx

Model Examples

Qwen3-TTS

Alibaba's state-of-the-art multilingual TTS with voice cloning, emotion control, and voice design capabilities.

from mlx_audio.tts.utils import load_model

model = load_model("mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16")
results = list(model.generate(
    text="Hello, welcome to MLX-Audio!",
    voice="Chelsie",
    language="English",
))

audio = results[0].audio  # mx.array

See the Qwen3-TTS README for voice cloning, CustomVoice, VoiceDesign, and all available models.

OmniVoice

OmniVoice is a zero-shot multilingual TTS model for 646+ languages with voice cloning, batch generation, pronunciation controls, and nonverbal tags such as [laughter] and [sigh]. It uses a bidirectional Qwen3 backbone with iterative masked generation and a Hig

Extension points exported contracts — how you extend this code

LayoutWrapperProps (Interface)
(no doc)
mlx_audio/ui/components/layout-wrapper.tsx
VoiceLibraryProps (Interface)
(no doc)
mlx_audio/ui/components/voice-library.tsx
AudioOrbProps (Interface)
(no doc)
mlx_audio/ui/components/audio-orb.tsx
VoiceSelectionProps (Interface)
(no doc)
mlx_audio/ui/components/voice-selection.tsx
NavbarProps (Interface)
(no doc)
mlx_audio/ui/components/navbar.tsx

Core symbols most depended-on inside this repo

get
called by 906
mlx_audio/sts/voice_pipeline.py
append
called by 643
mlx_audio/tts/models/fish_qwen3_omni/prompt.py
items
called by 406
mlx_audio/sts/models/lfm_audio/processor.py
eval
called by 221
mlx_audio/tts/models/dia/dia.py
split
called by 214
mlx_audio/sts/models/mel_roformer/model.py
append
called by 201
mlx_audio/stt/models/voxtral_realtime/streaming.py
pad
called by 189
mlx_audio/stt/models/wav2vec/feature_extractor.py
append
called by 147
mlx_audio/sts/voice_pipeline.py

Shape

Method 5,950
Class 1,983
Function 1,477
Route 43
Interface 10

Languages

Python99%
TypeScript1%

Modules by API surface

mlx_audio/tts/tests/test_models.py659 symbols
mlx_audio/stt/tests/test_models.py128 symbols
mlx_audio/tts/models/bailingmm/bailingmm.py123 symbols
mlx_audio/codec/models/fish_s1_dac/fish_s1_dac.py112 symbols
mlx_audio/tts/models/qwen3_tts/speech_tokenizer.py97 symbols
mlx_audio/sts/voice_pipeline.py93 symbols
mlx_audio/server.py89 symbols
mlx_audio/sts/tests/test_voice_pipeline.py81 symbols
mlx_audio/codec/models/dacvae/codec.py78 symbols
mlx_audio/stt/models/cohere_asr/cohere_asr.py74 symbols
mlx_audio/vad/models/sortformer/sortformer.py73 symbols
mlx_audio/stt/models/vibevoice_asr/tests/test_vibevoice_asr.py72 symbols

Dependencies from manifests, versioned

@react-three/drei9.105.6 · 1×
@react-three/fiber8.16.8 · 1×
@types/bunlatest · 1×
@types/fluent-ffmpeg2.1.27 · 1×
@types/node20 · 1×
@types/react18 · 1×
@types/react-dom18 · 1×
D1.0.0 · 1×
class-variance-authority0.7.1 · 1×
clsx2.1.1 · 1×
eslint8 · 1×
eslint-config-next14.2.16 · 1×

For agents

$ claude mcp add mlx-audio \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact