I have a set of audio files that are uploaded by users, and there is no knowing what they contain. I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points. (I'm targeting a Linux environment, and developing on a Mac) I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right. I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'. Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code. It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file. The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want: <pre class="prettyprint"><code>import webrtcvad vad = webrtcvad.Vad() # sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz, # and 10, 20, or 30 milliseconds long. print vad.is_voiced(sample) </code></pre>

What is a good approach for extracting portions of speech from an arbitrary audio file?

Tags:

linux

signal-processing

audio

voice

voice-detection

I have a set of audio files that are uploaded by users, and there is no knowing what they contain.

I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.

(I'm targeting a Linux environment, and developing on a Mac)

I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.

I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.

Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?

335

asked Mar 31 '11 10:03

stef

3 Answers

EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.

130

answered Oct 11 '22 21:10

Andrea Spadaccini

webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)

answered Oct 11 '22 20:10

John Wiseman

Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610

answered Oct 11 '22 22:10

Theodore Giannakopoulos

Related questions
                            
                                A general linux file permissions question: Apache and WordPress
                            
                                Should i look at VmSize, VmRSS, or some combination for memory stats on linux?
                            
                                How to get all parent processes and all subprocesses with `pstree`
                            
                                Suppress log entry for single sudo commands
                            
                                Why am I able to perform floating point operations inside a Linux kernel module?
                            
                                ssh config name alias not working for scp [closed]
                            
                                Unrecognized option: - Could not create the Java virtual machine
                            
                                Are gnu syslog(), openlog() and closelog() thread-safe?
                            
                                Daemonizing celery
                            
                                CentOS error - sudo: effective uid is not 0, is sudo installed setuid root?
                            
                                What format to use when entering an IP address into an EC2 Security Group rule?
                            
                                How to list dependencies of c/c++ static library?
                            
                                vscode "#include errors detected. Please update your includePath
                            
                                copy a directory structure with file names without content
                            
                                What happens to RAII objects after a process forks?
                            
                                Unable to execute script file with +x permission, even with sudo
                            
                                How are percpu pointers implemented in the Linux kernel?
                            
                                BASH: check for amount of memory installed on a system as sanity check
                            
                                Why is the address of __libc_start_main always the same inside GDB even though ASLR is on?
                            
                                How to run .net application on Linux environment?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With