MCPcopy
hub / github.com/babysor/MockingBird / embed_utterance

Function embed_utterance

encoder/inference.py:111–155  ·  view source on GitHub ↗

Computes an embedding for a single utterance. # TODO: handle multiple wavs to benefit from batching on GPU :param wav: a preprocessed (see audio.py) utterance waveform as a numpy array of float32 :param using_partials: if True, then the utterance is split in partial utterances

(wav, using_partials=True, return_partials=False, **kwargs)

Source from the content-addressed store, hash-verified

109
110
111def embed_utterance(wav, using_partials=True, return_partials=False, **kwargs):
112 """
113 Computes an embedding for a single utterance.
114
115 # TODO: handle multiple wavs to benefit from batching on GPU
116 :param wav: a preprocessed (see audio.py) utterance waveform as a numpy array of float32
117 :param using_partials: if True, then the utterance is split in partial utterances of
118 <partial_utterance_n_frames> frames and the utterance embedding is computed from their
119 normalized average. If False, the utterance is instead computed from feeding the entire
120 spectogram to the network.
121 :param return_partials: if True, the partial embeddings will also be returned along with the
122 wav slices that correspond to the partial embeddings.
123 :param kwargs: additional arguments to compute_partial_splits()
124 :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
125 <return_partials> is True, the partial utterances as a numpy array of float32 of shape
126 (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
127 returned. If <using_partials> is simultaneously set to False, both these values will be None
128 instead.
129 """
130 # Process the entire utterance if not using partials
131 if not using_partials:
132 frames = audio.wav_to_mel_spectrogram(wav)
133 embed = embed_frames_batch(frames[None, ...])[0]
134 if return_partials:
135 return embed, None, None
136 return embed
137
138 # Compute where to split the utterance into partials and pad if necessary
139 wave_slices, mel_slices = compute_partial_slices(len(wav), **kwargs)
140 max_wave_length = wave_slices[-1].stop
141 if max_wave_length >= len(wav):
142 wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
143
144 # Split the utterance into partials
145 frames = audio.wav_to_mel_spectrogram(wav)
146 frames_batch = np.array([frames[s] for s in mel_slices])
147 partial_embeds = embed_frames_batch(frames_batch)
148
149 # Compute the utterance embedding from the partial embeddings
150 raw_embed = np.mean(partial_embeds, axis=0)
151 embed = raw_embed / np.linalg.norm(raw_embed, 2)
152
153 if return_partials:
154 return embed, partial_embeds, wave_slices
155 return embed
156
157
158def embed_speaker(wavs, **kwargs):

Callers

nothing calls this directly

Calls 2

embed_frames_batchFunction · 0.85
compute_partial_slicesFunction · 0.85

Tested by

no test coverage detected