Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I invert a MelSpectrogram with torchaudio and get an audio waveform?

I have a MelSpectrogram generated from:

eval_seq_specgram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate, n_fft=256)(eval_audio_data).transpose(1, 2)

So eval_seq_specgram now has a size of torch.Size([1, 128, 499]), where 499 is the number of timesteps and 128 is the n_mels.

I'm trying to invert it, so I'm trying to use GriffinLim, but before doing that, I think I need to invert the melscale, so I have:

inverse_mel_pred = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=256)(eval_seq_specgram)

inverse_mel_pred has a size of torch.Size([1, 256, 499])

Then I'm trying to use GriffinLim:

pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)

but I get an error:

Traceback (most recent call last):
  File "evaluate_spect.py", line 63, in <module>
    main()
  File "evaluate_spect.py", line 51, in main
    pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/transforms.py", line 169, in forward
    return F.griffinlim(specgram, self.window, self.n_fft, self.hop_length, self.win_length, self.power,
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/functional.py", line 179, in griffinlim
    inverse = torch.istft(specgram * angles,
RuntimeError: The size of tensor a (256) must match the size of tensor b (129) at non-singleton dimension 1

Not sure what I'm doing wrong or how to resolve this.

like image 330
Shamoon Avatar asked Nov 12 '20 18:11

Shamoon


People also ask

Can Torchaudio load mp3?

torchaudio.info function fetches metadata of audio. You can provide a path-like object or file-like object. bits_per_sample can be 0 for formats with compression and/or variable bit rate (such as mp3).

What is Torch audio?

Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.

What is a Mel spectrogram?

What is Mel Spectrogram? Mel spectrogram is a spectrogram that is converted to a Mel scale. Then, what is the spectrogram and The Mel Scale? A spectrogram is a visualization of the frequency spectrum of a signal, where the frequency spectrum of a signal is the frequency range that is contained by the signal.


1 Answers

By looking at the documentation and by doing a quick test on colab it seems that:

  1. When you create the MelSpectrogram with n_ftt = 256, 256/2+1 = 129 bins are generated
  2. At the same time InverseMelScale took as input the parameter called n_stft that indicates the number of bins (so in your case should be set to 129)

As a side note, I don't understand why you need the transpose call, since according to the doc and my tests

waveform, sample_rate = torchaudio.load('test.wav')
mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)

already returns a (channel, n_mels, time) tensor and InverseMelScale wants a tensor of shape (…, n_mels, time)

like image 143
emmunaf Avatar answered Sep 20 '22 09:09

emmunaf