I am working on a project where I have to extract the human sound from a audio .wav file using java.
The audio .wav file may have 3 to 4 sounds like dog, cat, music and human. I will have to identify the human sound then exatract that part from the audio .wav file.
I am using FFT.java and Complex.java.
Now I have written an AudioFileReader class which reads the audio.wav file from the hard-drive and then convert this to bytes array. Then used the above mentioned FFT.java and Complex.java to apply FFT.fft(bytesArray), which gives me Complex array in return;
Now the problem is how to extract the human sound byte pattern from the returned Complex array... does anyone know how I might be able to achieve this?
I think the standard way to handle problems like this are to convert the input signals into a Cepstrum or Mel-Cepstrum representation and then use the coefficients for the feature space for input into a classifier. There are many research papers that discuss solutions to these sorts of problems based on this basic approach, for example:
http://www.ics.forth.gr/netlab/data/J17.pdf
One possible shortcut you might try would be to put the input signals through a low bit-rate vocoder such as AMBE, then decode, and compare the quality of the original signal to the encoded/decoded signal. These vocoders are designed to highly compress human speech with fair to good quality at the expense of not being able to adequately represent non-speech sounds.
This can be achieved by AI (and little short of that). You might investigate APIs for speech recognition, but I doubt their ability to support signals with noise in the background.
E.G.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With