Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

webRTC : How to apply webRTC's VAD on audio through samples obtained from WAV file

Currently, I am parsing wav files and storing samples in std::vector<int16_t> sample. Now, I want to apply VAD (Voice Activity Detection) on this data to find out the "regions" of voice, and more specifically the start and end of words.

The parsed wav files are 16KHz, 16 bit PCM, mono. My code is in C++.

I have searched a lot about it but could not find proper documentation regarding webRTC's VAD functions.

From what I have found, the function that I need to use is WebRtcVad_Process(). It's prototype is written below :

int WebRtcVad_Process(VadInst* handle, int fs, const int16_t* audio_frame,
                      size_t frame_length)

From what I found here : https://stackoverflow.com/a/36826564/6487831

Each frame of audio that you send to the VAD must be 10, 20 or 30 milliseconds long. Here's an outline of an example that assumes audio_frame is 10 ms (320 bytes) of audio at 16000 Hz:

int is_voiced = WebRtcVad_Process(vad, 16000, audio_frame, 160);

It makes sense :

1 sample = 2B = 16 bits  
SampleRate = 16000 sample/sec = 16 samples/ms  
For 10 ms, no of samples    =   160  

So, based on that I have implemented this :

const int16_t * temp = sample.data();
for(int i = 0, ms = 0; i < sample.size(); i += 160, ms++)
{
    int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
    std::cout<<ms<<" ms : "<<isActive<<std::endl;
    temp = temp + 160; // processed 160 samples
}

Now, I am not really sure if this is correct. Also, I am also unsure about whether this gives me correct output or not.

So,

  • Is it possible to use the samples parsed directly from the wav files, or does it need some processing?
  • Am I looking at the correct function to do the job?
  • How to use the function to properly perform VAD on the audio stream?
  • Is it possible to distinct between the spoken words?
  • What is the best way to check if the output I am getting is correct?
  • If not, what is the best way to do this task?
like image 449
Saurabh Shrivastava Avatar asked Jun 09 '17 11:06

Saurabh Shrivastava


1 Answers

I'll start by saying that no, I don't think you will be able to segment an utterance into individual words using VAD. From the article on speech segmentation in Wikipedia:

One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word.

That said, I'll try to answer your other questions.

  1. You need to decode the WAV file, which could be compressed, into raw PCM audio data before running VAD. See e.g. Reading and processing WAV file data in C/C++. Alternately, you could use something like sox to convert the WAV file to raw audio before running your code. This command will convert a WAV file of any format to 16 KHz, 16-bit PCM in the format that WebRTCVAD expects:

    sox my_file.wav -r 16000 -b 16 -c 1 -e signed-integer -B my_file.raw
    
  2. It looks like you are using the right function. To be more specific, you should be doing this:

    #include "webrtc/common_audio/vad/include/webrtc_vad.h"
    // ...
    VadInst *vad;
    WebRtcVad_Create(&vad);
    WebRtcVad_Init(vad);
    const int16_t * temp = sample.data();
    for(int i = 0, ms = 0; i < sample.size(); i += 160, ms += 10)
    {
      int isActive = WebRtcVad_Process(vad, 16000, temp, 160); //10 ms window
      std::cout << ms << " ms : " << isActive << std::endl;
      temp = temp + 160; // processed 160 samples (320 bytes)
    }
    
  3. To see if it's working, you can run known files and see if you get the results you expect. For example, you could start by processing silence and confirm that you never (or rarely--this algorithm is not perfect) see a voiced result come back from WebRtcVad_Process. Then you could try a file that is all silence except for one short utterance in the middle, etc. If you want to compare to an existing test, the py-webrtcvad module has a unit test that does this; see the test_process_file function.

  4. To do word-level segmentation, you will probably need to find a speech recognition library that does it or gives you access to the information that you need to do it. E.g. this thread on the Kaldi mailing list seems to talks about how to segment by words.

like image 172
John Wiseman Avatar answered Sep 28 '22 09:09

John Wiseman