How can I invert a MelSpectrogram with torchaudio and get an audio waveform?

Tags:

I have a MelSpectrogram generated from:

eval_seq_specgram = torchaudio.transforms.MelSpectrogram(sample_rate=sample_rate, n_fft=256)(eval_audio_data).transpose(1, 2)

So eval_seq_specgram now has a size of torch.Size([1, 128, 499]), where 499 is the number of timesteps and 128 is the n_mels.

I'm trying to invert it, so I'm trying to use GriffinLim, but before doing that, I think I need to invert the melscale, so I have:

inverse_mel_pred = torchaudio.transforms.InverseMelScale(sample_rate=sample_rate, n_stft=256)(eval_seq_specgram)

inverse_mel_pred has a size of torch.Size([1, 256, 499])

Then I'm trying to use GriffinLim:

pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)

but I get an error:

Traceback (most recent call last):
  File "evaluate_spect.py", line 63, in <module>
    main()
  File "evaluate_spect.py", line 51, in main
    pred_audio = torchaudio.transforms.GriffinLim(n_fft=256)(inverse_mel_pred)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/transforms.py", line 169, in forward
    return F.griffinlim(specgram, self.window, self.n_fft, self.hop_length, self.win_length, self.power,
  File "/home/shamoon/.local/share/virtualenvs/speech-reconstruction-7HMT9fTW/lib/python3.8/site-packages/torchaudio/functional.py", line 179, in griffinlim
    inverse = torch.istft(specgram * angles,
RuntimeError: The size of tensor a (256) must match the size of tensor b (129) at non-singleton dimension 1

Not sure what I'm doing wrong or how to resolve this.

330

asked Nov 12 '20 18:11

Shamoon

1 Answers

By looking at the documentation and by doing a quick test on colab it seems that:

When you create the MelSpectrogram with n_ftt = 256, 256/2+1 = 129 bins are generated
At the same time InverseMelScale took as input the parameter called n_stft that indicates the number of bins (so in your case should be set to 129)

As a side note, I don't understand why you need the transpose call, since according to the doc and my tests

waveform, sample_rate = torchaudio.load('test.wav')
mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)

already returns a (channel, n_mels, time) tensor and InverseMelScale wants a tensor of shape (…, n_mels, time)

143

answered Sep 20 '22 09:09

emmunaf

Related questions
                            
                                Manually calling spark's garbage collection from pyspark
                            
                                Celery restart loss scheduled tasks
                            
                                Detecting comic strip dialogue bubble regions in images
                            
                                Since Tuples are immutable, why does slicing them make a copy instead of a view?
                            
                                Why doesn't except object catch everything in Python?
                            
                                Require login in a Django Channels socket?
                            
                                How To Format Email to Send as SMS
                            
                                Fatal Python error when using a dynamic version of Python to execute embedded python code
                            
                                How to do multiprocessing using Python for .NET on Windows?
                            
                                Graphviz: Make all nodes the same size as the largest
                            
                                sqlalchemy how to generate (many-to-many) relationships with automap_base
                            
                                conda-build of official AnacondaRecipes/opencv-feedstock fails looking for libpng.h
                            
                                Pandas 0.23 groupby and pct change not returning expected value
                            
                                How do you tell whether sys.stdin.readline() is going to block?
                            
                                Why does numpy.sin return a different result if the argument size is greater than 8192?
                            
                                Extract upwards pointing lane lines
                            
                                Multipart/mixed email attachments not showing up, but only in Windows 10 Mail
                            
                                How to get Agglomerative Clustering "Centroid" in python Scikit-learn
                            
                                Implementing the TD-Gammon algorithm
                            
                                12296:26672:0420/163936.459:ERROR:browser_switcher_service.cc(238) XXX Init() Error in "Selenium Python"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I invert a MelSpectrogram with torchaudio and get an audio waveform?

Tags:

python

pytorch

torchaudio

Shamoon

People also ask

1 Answers

emmunaf

Recent Activity

Donate For Us