I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.
In other words, I want to give it the audio file and get something like this:
[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]
Is there a solution (preferably in python) that only finds human speech and runs on a local machine?
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.
To use it for your purpose, you would do something like this:
vad = webrtcvad.Vad()
vad.is_speech(chunk, sample_rate)
The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.
You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.
A ring buffer (collections.deque
in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With