Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample
. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.
The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.
I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.
From what I have found, the function that I need to use is WebRtcVad_Process()
. It's prototype is written below :
int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
size_t frame_length)
From what I found here : https://stackoverflow.com/a/36826564/6487831
Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long. Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:
int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);
It makes sense :
1 sample = 2B = 16 bits
SampleRate = 16000 sample/sec = 16 samples/ms
For 10 ms, no of samples = 160
So, based on that I have implemented this :
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout<<ms<<" ms : "<<isActive<<std::endl;
temp = temp + 160; // processed 160 samples
}
Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.
So,
I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:
One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.
That said, I'll try to answer your other questions.
You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox
to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:
sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
It looks like you are using the right function. To be more specific, you should be doing this:
#include "webrtc/common_audio/vad/include/webrtc_vad.h"
// ...
VadInst *vad;
WebRtcVad_Create(&vad);
WebRtcVad_Init(vad);
const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
{
int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
std::cout << ms << " ms : " << isActive << std::endl;
temp = temp + 160; // processed 160 samples (320 bytes)
}
To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process
. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file
function.
To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With