A lot of articles are using CNNs to extract audio features. The input data is a spectrogram with two dimensions, time and frequency.
When creating an audio spectrogram, you need to specify the exact size of both dimensions. But they are usually not fixed. One can specify the size of the frequency dimension through the window size, but what about the time domain? The lengths of audio samples are different, but the size of the input data of CNNs should be fixed.
In my datasets, the audio length ranges from 1s to 8s. Padding or Cutting always impacts the results too much.
So I want to know more about this method.
CNNs are computed on frame windows basis. You take say 30 surrounding frames and train CNN to classify them. You need to have frame labels in this case which you can get from other speech recognition toolkit.
If you want to have pure neural network decoding, you'd better train recurrent neural network (RNN), they allow arbitrary length inputs. To increase accuracy of RNNs you also better have CTC layer which will allow adjust state alignment without network.
If you are interested in the subject you can try https://github.com/srvk/eesen, a toolkit designed for end-to-end speech recognition with recurrent neural networks.
Also related Applying neural network to MFCCs for variable-length speech segments
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With