How to handle dynamic input size for audio spectrogram used in CNN?

Question

A lot of articles are using CNNs to extract audio features. The input data is a spectrogram with two dimensions, time and frequency.

When creating an audio spectrogram, you need to specify the exact size of both dimensions. But they are usually not fixed. One can specify the size of the frequency dimension through the window size, but what about the time domain? The lengths of audio samples are different, but the size of the input data of CNNs should be fixed.

In my datasets, the audio length ranges from 1s to 8s. Padding or Cutting always impacts the results too much.

So I want to know more about this method.

Nikolay Shmyrev · Accepted Answer

CNNs are computed on frame windows basis. You take say 30 surrounding frames and train CNN to classify them. You need to have frame labels in this case which you can get from other speech recognition toolkit.

If you want to have pure neural network decoding, you'd better train recurrent neural network (RNN), they allow arbitrary length inputs. To increase accuracy of RNNs you also better have CTC layer which will allow adjust state alignment without network.

If you are interested in the subject you can try https://github.com/srvk/eesen, a toolkit designed for end-to-end speech recognition with recurrent neural networks.

Also related Applying neural network to MFCCs for variable-length speech segments

How to handle dynamic input size for audio spectrogram used in CNN?

Tags:

signal-processing

conv-neural-network

speech-recognition

spectrogram

Luv

Video Answer

1 Answers

Nikolay Shmyrev

Recent Activity

Donate For Us

How to handle dynamic input size for audio spectrogram used in CNN?

Tags:

signal-processing

conv-neural-network

speech-recognition

spectrogram

Luv

Video Answer

1 Answers

Nikolay Shmyrev

Related questions

Recent Activity

Donate For Us