I am planning to build a software which can classify a piece of music as good or bad using artificial neural networks. For this, I need to convert audio into some numerical values to feed to NN as input. So for training the NN, I first downloaded billboard hot 100 songs (which I believe should classify as good music), and also downloaded some bad noise audio files (which will classify as bad music). Then I converted them to .wav format and then split each file into multiple .wav files of length 2 seconds each. I was planning to use fast fourier transform to convert these audio clippings to frequency - amplitude pairs, but the problem is, even if we use a 2 second clip, its FFT would generate array of about 100,000 such pairs. And doing this to thousands of audio files would generate too big dataset with too many features.
I wanted to know is there any way we could shorten this dataset, while keeping the 'essence of music' in it so that better predictions can be made? Or should I use some other algorithm/ process?
Two commonly used approaches are: A CNN (Convolutional Neural Network) plus RNN-based (Recurrent Neural Network) architecture that uses the CTC Loss algorithm to demarcate each character of the words in the speech. eg. Baidu's Deep Speech model.
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio.
The input layer of a neural network is composed of artificial input neurons, and brings the initial data into the system for further processing by subsequent layers of artificial neurons. The input layer is the very beginning of the workflow for the artificial neural network.
Adding noise means that the network is less able to memorize training samples because they are changing all of the time, resulting in smaller network weights and a more robust network that has lower generalization error.
At first, you can extract the various audio features like:
1) Compactness.
2) Magnitude spectrum.
3) Mel-frequency cepstral coefficients.
4) Pitch.
5) Power Spectrum.
6) RMS.
7) Rhythm.
8) Spectral Centroid.
9) Spectral Flux.
10) Spectral RollOff Point.
11) Spectral Variability.
12) Zero Crossings.
After generating the feature set you have two options:
A) Aggregate the particular feature of a song by taking mean [and/or variance], concatenate the whole features for a song, then feed into the Artifical Neural Network and perform the classification task.
B) Use the Recurrent Neural Network for the classification task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With