Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good approach for extracting portions of speech from an arbitrary audio file?

I have a set of audio files that are uploaded by users, and there is no knowing what they contain.

I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.

(I'm targeting a Linux environment, and developing on a Mac)

I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.

I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.

Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?

like image 335
stef Avatar asked Mar 31 '11 10:03

stef


People also ask

What is audio feature extraction?

Audio feature extraction is a necessary step in audio signal processing, which is a subfield of signal processing. It deals with the processing or manipulation of audio signals. It removes unwanted noise and balances the time-frequency ranges by converting digital and analog signals.


3 Answers

EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output 

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File
***

loadFeatureFileExtension        .prm
minLLK                          -200
maxLLK                          1000
bigEndian                       false
loadFeatureFileFormat           SPRO4
saveFeatureFileFormat           SPRO4
saveFeatureFileSPro3DataKind    FBCEPSTRA
featureServerBufferSize         ALL_FEATURES
featureServerMemAlloc           50000000
featureFilesPath                prm/
mixtureFilesPath                gmm/
lstPath                         lst/
labelOutputFrames               speech
labelSelectedFrames             all
addDefaultLabel                 true
defaultLabel                    all
saveLabelFileExtension          .lbl
labelFilesPath                  lbl/    
frameLength                     0.01
segmentalMode                   file
nbTrainIt                       8       
varianceFlooring                0.0001
varianceCeiling                 1.5     
alpha                           0.25
mixtureDistribCount             3
featureServerMask               19      
vectSize                        1
baggedFrameProbabilityInit      0.1
thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.

like image 130
Andrea Spadaccini Avatar answered Oct 11 '22 21:10

Andrea Spadaccini


webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvad

vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)
like image 40
John Wiseman Avatar answered Oct 11 '22 20:10

John Wiseman


Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS

[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smoothWindow=1.0, 
                             Weight=0.3, 
                             plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610

like image 29
Theodore Giannakopoulos Avatar answered Oct 11 '22 22:10

Theodore Giannakopoulos