MCPcopy
hub / github.com/jamiepine/voicebox

github.com/jamiepine/voicebox @v0.5.0 sqlite

repository ↗ · DeepWiki ↗ · release v0.5.0 ↗
1,997 symbols 5,671 edges 360 files 657 documented · 33%
README

Voicebox

Voicebox

The open-source AI voice studio.

Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own.

The full voice I/O stack, running locally on your machine.

Downloads Release Stars License Ask DeepWiki

voicebox.shDocsDownloadFeaturesAPITroubleshooting

Voicebox App Screenshot

Click the image above to watch the demo video on voicebox.sh

Voicebox Screenshot 2

Voicebox Screenshot 3

What is Voicebox?

Voicebox is a local-first AI voice studio — a free and open-source alternative to ElevenLabs and WisprFlow in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing.

The two cloud incumbents sit on opposite halves of the voice I/O loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine.

  • Complete privacy — models, voice data, and captures never leave your machine
  • 7 TTS engines — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
  • Voice cloning and preset voices — zero-shot cloning from a reference sample, or 50+ curated preset voices via Kokoro and Qwen CustomVoice
  • 23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more
  • Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters
  • Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
  • Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters
  • Stories editor — multi-track timeline for conversations, podcasts, and narratives
  • Voice input — global dictation hotkey with push-to-talk and toggle modes, accessibility-verified auto-paste on macOS, in-app mic on every text field, Whisper-based STT
  • Agent voice output — one tool call (voicebox.speak) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you've cloned
  • Voice personalities — attach a free-form persona to any voice profile, then Compose, Rewrite, or Respond via a bundled local LLM — agents can invoke the same modes over MCP
  • API-first — REST API plus a built-in MCP server for integrating voice I/O into your own apps and agents
  • Native performance — built with Tauri (Rust), not Electron
  • Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

Download

Platform Download
macOS (Apple Silicon) Download DMG
macOS (Intel) Download DMG
Windows Download MSI
Docker docker compose up

View all binaries →

Linux — Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.

Having trouble? See the Troubleshooting Guide for common install, generation, model-download, and GPU issues.


Features

Multi-Engine Voice Cloning

Seven TTS engines with different strengths, switchable per-generation:

Engine Languages Strengths
Qwen3-TTS (0.6B / 1.7B) 10 High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
Qwen CustomVoice 10 9 curated preset voices with natural-language delivery control — no reference audio required
LuxTTS English Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual 23 Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
Chatterbox Turbo English Fast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B) 10 HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment
Kokoro 8 50 curated preset voices, tiny 82M model, fast CPU inference

Emotions & Paralinguistic Tags

Only Chatterbox Turbo interprets paralinguistic tags like [laugh] and [sigh]. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them literally as text.

With Chatterbox Turbo selected, type / in the text input to open the tag inserter and add expressive tags inline with speech:

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Post-Processing Effects

8 audio effects powered by Spotify's pedalboard library. Apply after generation, preview in real time, build reusable presets.

Effect Description
Pitch Shift Up or down by up to 12 semitones
Reverb Configurable room size, damping, wet/dry mix
Delay Echo with adjustable time, feedback, and mix
Chorus / Flanger Modulated delay for metallic or lush textures
Compressor Dynamic range compression
Gain Volume adjustment (-40 to +40 dB)
High-Pass Filter Remove low frequencies
Low-Pass Filter Remove high frequencies

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.

Unlimited Generation Length

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.

  • Configurable auto-chunking limit (100–5,000 chars)
  • Crossfade slider (0–200ms) for smooth transitions
  • Max text length: 50,000 characters
  • Smart splitting respects abbreviations, CJK punctuation, and [tags]

Generation Versions

Every generation supports multiple versions with provenance tracking:

  • Original — clean TTS output, always preserved
  • Effects versions — apply different effects chains from any source version
  • Takes — regenerate with a new seed for variation
  • Source tracking — each version records its lineage
  • Favorites — star generations for quick access

Async Generation Queue

Generation is non-blocking. Submit and immediately start typing the next one.

  • Serial execution queue prevents GPU contention
  • Real-time SSE status streaming
  • Failed generations can be retried
  • Stale generations from crashes auto-recover on startup

Voice Profile Management

  • Create profiles from audio files or record directly in-app
  • Import/export profiles to share or back up
  • Multi-sample support for higher quality cloning
  • Per-profile default effects chains
  • Organize with descriptions and language tags

Stories Editor

Multi-voice timeline editor for conversations, podcasts, and narratives.

  • Multi-track composition with drag-and-drop
  • Inline audio trimming and splitting
  • Auto-playback with synchronized playhead
  • Version pinning per track clip

Global Dictation & Voice Input

The other half of the voice I/O loop. Hold a hotkey anywhere on your system, speak, release — on macOS the transcript pastes straight into the focused text field. Or hit the mic on any Voicebox text input and dictate directly into the app.

  • Configurable chord bindings — hold-to-speak and tap-to-toggle chords, each rebindable in the in-app chord picker. Holding push-to-talk and tapping Space mid-hold upgrades into a toggle session without a gap in audio
  • Target-aware paste (macOS) — accessibility-verified injection into the focused text field, with atomic clipboard save/restore so your clipboard isn't clobbered
  • First-run permissions UX — in-app gates walk you through the macOS Accessibility and Input Monitoring grants with deep-links to System Settings
  • In-app mic button on every Voicebox text field — generation form, profile descriptions, story titles, anywhere you'd type
  • LLM refinement — optional cleanup of ums, stutters, and false starts before paste
  • On-screen pill — floating overlay surfacing recording, transcribing, refining, and speaking states. Same pill agents use when they speak to you, so there's one mental model for both directions of the loop

Speech-to-Text

Voicebox runs OpenAI Whisper for transcription — the same model that backs dictation, the Captures tab, and the /transcribe API. Running on MLX (Apple Silicon) or PyTorch (CUDA / ROCm / DirectML / CPU) depending on your platform.

Size Notes
Base / Small / Medium / Large Standard Whisper quality ladder
Turbo ~8x faster than Whisper Large, minimal quality loss

More engines (Parakeet v3, Qwen3-ASR) are planned — see Roadmap.

Captures

Every dictation, in-app recording, and uploaded audio file lands in the Captures tab — original audio paired with transcript, always preserved.

  • Replay, re-transcribe, refine — rerun STT with any Whisper size, or re-run the raw transcript through the local LLM with different flags (filler cleanup, self-correction removal, technical-term preservation)
  • Edit inline — tweak the transcript and save on blur
  • Play as voice profile — turn any capture into speech with a cloned voice, one click
  • Promote to voice sample — use a capture's audio + transcript as a reference sample on any voice profile
  • Local capture storage — original audio and transcript stay in your Voicebox data directory, with a folder shortcut in Settings

Agent Voice Output

Every agent gets a voice. One tool call and any MCP-aware agent can speak to you in a voice you've cloned — task completions, questions, notifications. The same pill that surfaces during dictation surfaces during agent speech, so you always see what's coming out of your machine.

// In any MCP-aware agent:
await voicebox.speak({
  text: "Deploy complete.",
  profile: "Morgan",
});

Also exposed as POST /speak for anything that doesn't speak MCP — ACP, A2A, shell scripts, custom harnesses.

  • Bidirectional pillrecording, transcribing, refining, and speaking are all states of the same OS-level overlay, so dictation and agent speech share one surface
  • Per-agent voice binding — in Settings → MCP, pin Claude Code to Morgan and Cursor to Scarlett so you can tell which agent is talking without looking. Each client's last_seen_at timestamp confirms the install actually took
  • Always visible — no silent background TTS; every agent-initiated speak surfaces the pill with the voice profile name for the full duration
  • HTTP + stdio transports — install as a URL in Claude Code / Cu

Extension points exported contracts — how you extend this code

PlatformUpdater (Interface)
(no doc) [4 implementers]
app/src/platform/types.ts
DemoStep (Interface)
Each entry is one cycle of the demo animation: select a profile → type text → generate → play audio.
landing/src/components/ControlUI.tsx
PlatformLifecycle (Interface)
(no doc) [4 implementers]
app/src/platform/types.ts
Generation (Interface)
History rows pre-populated on first load. Oldest first visually (array index 0 = top row).
landing/src/components/ControlUI.tsx
Register (Interface)
(no doc)
app/src/router.tsx
VoiceProfile (Interface)
(no doc)
landing/src/components/ControlUI.tsx
Window (Interface)
(no doc)
app/src/global.d.ts
LandingAudioPlayerProps (Interface)
(no doc)
landing/src/components/LandingAudioPlayer.tsx

Core symbols most depended-on inside this repo

cn
called by 145
app/src/lib/utils/cn.ts
toast
called by 134
app/src/components/ui/use-toast.ts
request
called by 69
app/src/lib/api/client.ts
catch
called by 67
app/src/lib/api/core/CancelablePromise.ts
then
called by 38
app/src/lib/api/core/CancelablePromise.ts
usePlatform
called by 29
app/src/platform/PlatformContext.tsx
_add_column
called by 27
backend/database/migrations.py
useToast
called by 23
app/src/components/ui/use-toast.ts

Shape

Function 1,134
Method 421
Class 169
Interface 157
Route 116

Languages

Python53%
TypeScript47%

Modules by API surface

app/src/lib/api/client.ts102 symbols
backend/models.py84 symbols
app/src/lib/api/types.ts60 symbols
backend/backends/__init__.py42 symbols
backend/routes/profiles.py38 symbols
backend/pyi_rth_torch_compiler_disable.py38 symbols
backend/routes/stories.py32 symbols
app/src/platform/types.ts31 symbols
backend/routes/models.py27 symbols
backend/backends/pytorch_backend.py24 symbols
app/src/lib/api/services/DefaultService.ts24 symbols
backend/tests/test_all_models_e2e.py23 symbols

Dependencies from manifests, versioned

@biomejs/biome2.3.12 · 1×
@dnd-kit/core6.3.1 · 1×
@dnd-kit/sortable10.0.0 · 1×
@dnd-kit/utilities3.2.2 · 1×
@fontsource/space-grotesk5.2.10 · 1×
@hookform/resolvers3.9.0 · 1×
@icons-pack/react-simple-icons13.13.0 · 1×
@radix-ui/react-alert-dialog1.1.1 · 1×
@radix-ui/react-avatar1.1.0 · 1×
@radix-ui/react-dialog1.1.1 · 1×
@radix-ui/react-dropdown-menu2.1.1 · 1×
@radix-ui/react-label2.1.0 · 1×

For agents

$ claude mcp add voicebox \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact