hub / github.com/ysharma3501/LuxTTS

github.com/ysharma3501/LuxTTS @main sqlite

508 symbols 2,060 edges 36 files 211 documented · 42%

README

LuxTTS

LuxTTS is an lightweight zipvoice based text-to-speech model designed for high quality voice cloning and realistic generation at speeds exceeding 150x realtime.

https://github.com/user-attachments/assets/a3b57152-8d97-43ce-bd99-26dc9a145c29

The main features are

Voice cloning: SOTA voice cloning on par with models 10x larger.
Clarity: Clear 48khz speech generation unlike most TTS models which are limited to 24khz.
Speed: Reaches speeds of 150x realtime on a single GPU and faster then realtime on CPU's as well.
Efficiency: Fits within 1gb vram meaning it can fit in any local gpu.

Usage

You can try it locally, colab, or spaces.

Simple installation:

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Load model:

from zipvoice.luxvoice import LuxTTS

# load model on GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')

# load model on CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)

# load model on MPS for macs
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

Simple inference

import soundfile as sf
from IPython.display import Audio

text = "Hey, what's up? I'm feeling really great if you ask me honestly!"

## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'

## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)

## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

## display speech
if display is not None:
  display(Audio(final_wav, rate=48000))

Inference with sampling params:

import soundfile as sf
from IPython.display import Audio

text = "Hey, what's up? I'm feeling really great if you ask me honestly!"

## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'

rms = 0.01 ## higher makes it sound louder(0.01 or so recommended)
t_shift = 0.9 ## sampling param, higher can sound better but worse WER
num_steps = 4 ## sampling param, higher sounds better but takes longer(3-4 is best for efficiency)
speed = 1.0 ## sampling param, controls speed of audio(lower=slower)
return_smooth = False ## sampling param, makes it sound smoother possibly but less cleaner
ref_duration = 5 ## Setting it lower can speedup inference, set to 1000 if you find artifacts.

## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, duration=ref_duration, rms=rms)

## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=num_steps, t_shift=t_shift, speed=speed, return_smooth=return_smooth)

## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

## display speech
if display is not None:
  display(Audio(final_wav, rate=48000))

Tips

Please use at minimum a 3 second audio file for voice cloning.
You can use return_smooth = True if you hear metallic sounds.
Lower t_shift for less possible pronunciation errors but worse quality and vice versa.

Community

Lux-TTS-Gradio: A gradio app to use LuxTTS.
OptiSpeech: Clean UI app to use LuxTTS.
LuxTTS-Comfyui: Nodes to use LuxTTS in comfyui.
FalAI hosting: Luxtts demo hosted by FalAI
Lux-TTS ONNX: Onnx code to use luxtts

Thanks to all community contributions!

Info

Q: How is this different from ZipVoice?

A: LuxTTS uses the same architecture but distilled to 4 steps with an improved sampling technique. It also uses a custom 48khz vocoder instead of the default 24khz version.

Q: Can it be even faster?

A: Yes, currently it uses float32. Float16 should be significantly faster(almost 2x).

Roadmap

[x] Release model and code
[x] Huggingface spaces demo
[x] Release MPS support (thanks to @builtbybasit)
[ ] Release LuxTTS v1.5
[ ] Release code for float16 inference

Acknowledgments

ZipVoice for their excellent code and model.
Vocos for their great vocoder.

Final Notes

The model and code are licensed under the Apache-2.0 license. See LICENSE for details.

Stars/Likes would be appreciated, thank you.

Email: yatharthsharma350@gmail.com

Core symbols most depended-on inside this repo

mean

called by 28

zipvoice/utils/optim.py

load_state_dict

called by 24

zipvoice/utils/lr_scheduler.py

backward

called by 24

zipvoice/models/modules/scaling.py

device

called by 23

zipvoice/utils/feature.py

state_dict

called by 17

zipvoice/utils/lr_scheduler.py

write_summary

called by 12

zipvoice/utils/common.py

prepare_input

called by 12

zipvoice/utils/common.py

texts_to_token_ids

called by 12

zipvoice/tokenizer/tokenizer.py

Shape

Method 247

Function 185

Class 76

Languages

Python100%

Modules by API surface

zipvoice/models/modules/scaling.py106 symbols

zipvoice/models/modules/zipformer.py48 symbols

zipvoice/tokenizer/tokenizer.py42 symbols

zipvoice/utils/common.py34 symbols

zipvoice/utils/diagnostics.py29 symbols

zipvoice/utils/optim.py17 symbols

zipvoice/utils/lr_scheduler.py17 symbols

zipvoice/tokenizer/normalizer.py17 symbols

zipvoice/models/modules/solver.py12 symbols

zipvoice/bin/train_zipvoice.py12 symbols

zipvoice/bin/infer_zipvoice_onnx.py12 symbols

zipvoice/utils/infer.py11 symbols

Dependencies from manifests, versioned

cn2an1×

huggingface-hub1×

inflect1×

jieba1×

lhotse1×

librosa1×

linacodec1×

numpy1×

onnxruntime1×

piper-phonemize1×

pydub1×

pypinyin1×

For agents

$ claude mcp add LuxTTS \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact