Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast way to find (not necessarily recognize) human speech in an audio file?

I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.

In other words, I want to give it the audio file and get something like this:

[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]

Is there a solution (preferably in python) that only finds human speech and runs on a local machine?

like image 232
Ilya Peterov Avatar asked Sep 15 '15 19:09

Ilya Peterov


1 Answers

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.

To use it for your purpose, you would do something like this:

  1. Convert file to be either 8 KHz or 16 Khz, 16-bit, mono format. This is required by the WebRTC code.
  2. Create a VAD object: vad = webrtcvad.Vad()
  3. Split the audio into 30 millisecond chunks.
  4. Check each chunk to see if it contains speech: vad.is_speech(chunk, sample_rate)

The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.

You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.

A ring buffer (collections.deque in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.

like image 121
John Wiseman Avatar answered Nov 03 '22 01:11

John Wiseman