Authors: Dabi Ahn(andabi412@gmail.com), Kyubyong Park(kbpark.linguist@gmail.com)
https://soundcloud.com/andabi/sets/voice-style-transfer-to-kate-winslet-with-deep-neural-networks
What if you could imitate a famous celebrity's voice or sing like a famous singer? This project started with a goal to convert someone's voice to a specific target voice. So called, it's voice style transfer. We worked on this project that aims to convert someone's voice to a famous English actress Kate Winslet's voice. We implemented a deep neural networks to achieve that and more than 2 hours of audio book sentences read by Kate Winslet are used as a dataset.

This is a many-to-one voice conversion system. The main significance of this work is that we could generate a target speaker's utterances without parallel data like
, or , but only waveforms of the target speaker. (To make these parallel datasets needs a lot of effort.) All we need in this project is a number of waveforms of the target speaker's utterances and only a small set of pairs from a number of anonymous speakers.

The model architecture consists of two modules: 1. Net1(phoneme classification) classify someone's utterances to one of phoneme classes at every timestep. * Phonemes are speaker-independent while waveforms are speaker-dependent. 2. Net2(speech synthesis) synthesize speeches of the target speaker from the phones.
We applied CBHG(1-D convolution bank + highway network + bidirectional GRU) modules that are mentioned in Tacotron. CBHG is known to be good for capturing features from sequential data.
Net2 contains Net1 as a sub-network. * Process: net1(wav -> spectrogram -> mfccs -> phoneme dist.) -> spectrogram -> wav * Net2 synthesizes the target speaker's speeches. * The input/target is a set of target speaker's utterances. * Since Net1 is already trained in previous step, the remaining part only should be trained in this step. * Loss is reconstruction error between input and target. (L2 distance) * Datasets * Target1(anonymous female): Arctic dataset (public) * Target2(Kate Winslet): over 2 hours of audio book sentences read by her (private) * Griffin-Lim reconstruction when reverting wav from spectrogram.
train1.py to train and eval1.py to test.train2.py to train and eval2.py to test.convert.py to get result samples.
$ claude mcp add deep-voice-conversion \
-- python -m otcore.mcp_server <graph>