MCPcopy
hub / github.com/Audio-AGI/AudioSep / get_audio_features

Function get_audio_features

models/CLAP/training/data.py:451–563  ·  view source on GitHub ↗

Calculate and add audio features to sample. Sample: a dict containing all the data of current sample. audio_data: a tensor of shape (T) containing audio data. max_len: the maximum length of audio data. data_truncating: the method of truncating data. data_filling: the method

(
    sample, audio_data, max_len, data_truncating, data_filling, audio_cfg
)

Source from the content-addressed store, hash-verified

449
450
451def get_audio_features(
452 sample, audio_data, max_len, data_truncating, data_filling, audio_cfg
453):
454 """
455 Calculate and add audio features to sample.
456 Sample: a dict containing all the data of current sample.
457 audio_data: a tensor of shape (T) containing audio data.
458 max_len: the maximum length of audio data.
459 data_truncating: the method of truncating data.
460 data_filling: the method of filling data.
461 audio_cfg: a dict containing audio configuration. Comes from model_cfg['audio_cfg'].
462 """
463 with torch.no_grad():
464 if len(audio_data) > max_len:
465 if data_truncating == "rand_trunc":
466 longer = torch.tensor([True])
467 elif data_truncating == "fusion":
468 # fusion
469 mel = get_mel(audio_data, audio_cfg)
470 # split to three parts
471 chunk_frames = (
472 max_len // audio_cfg["hop_size"] + 1
473 ) # the +1 related to how the spectrogram is computed
474 total_frames = mel.shape[0]
475 if chunk_frames == total_frames:
476 # there is a corner case where the audio length is
477 # larger than max_len but smaller than max_len+hop_size.
478 # In this case, we just use the whole audio.
479 mel_fusion = torch.stack([mel, mel, mel, mel], dim=0)
480 sample["mel_fusion"] = mel_fusion
481 longer = torch.tensor([False])
482 else:
483 ranges = np.array_split(
484 list(range(0, total_frames - chunk_frames + 1)), 3
485 )
486 # print('total_frames-chunk_frames:', total_frames-chunk_frames,
487 # 'len(audio_data):', len(audio_data),
488 # 'chunk_frames:', chunk_frames,
489 # 'total_frames:', total_frames)
490 if len(ranges[1]) == 0:
491 # if the audio is too short, we just use the first chunk
492 ranges[1] = [0]
493 if len(ranges[2]) == 0:
494 # if the audio is too short, we just use the first chunk
495 ranges[2] = [0]
496 # randomly choose index for each part
497 idx_front = np.random.choice(ranges[0])
498 idx_middle = np.random.choice(ranges[1])
499 idx_back = np.random.choice(ranges[2])
500 # select mel
501 mel_chunk_front = mel[idx_front : idx_front + chunk_frames, :]
502 mel_chunk_middle = mel[idx_middle : idx_middle + chunk_frames, :]
503 mel_chunk_back = mel[idx_back : idx_back + chunk_frames, :]
504
505 # shrink the mel
506 mel_shrink = torchvision.transforms.Resize(size=[chunk_frames, 64])(
507 mel[None]
508 )[0]

Callers 3

_get_audio_embedMethod · 0.90
infer_audioFunction · 0.90
preprocessFunction · 0.85

Calls 1

get_melFunction · 0.85

Tested by

no test coverage detected