I have a set of audio files that are uploaded by users, and there is no knowing what they contain.
I would like to take an arbitrary audio file, and extract each of the instances where someone is speaking into separate audio files. I don't want to detect the actual words, just the "started speaking", "stopped speaking" points and generate new files at these points.
(I'm targeting a Linux environment, and developing on a Mac)
I've found Sox, which looks promising, and it has a 'vad' mode (Voice Activity Detection). However this appears to find the first instance of speech and strips audio until that point, so it's close, but not quite right.
I've also looked at Python's 'wave' library, but then I'd need to write my own implementation of Sox's 'vad'.
Are there any command line tools that would do what I want off the shelf? If not, any good Python or Ruby approaches?
Audio feature extraction is a necessary step in audio signal processing, which is a subfield of signal processing. It deals with the processing or manipulation of audio signals. It removes unwanted noise and balances the time-frequency ranges by converting digital and analog signals.
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output
If you use the configuration file that you find at the end of the answer, you need to put output.prm
in prm/
, and you'll find the segmentation in lbl/
.
As a reference, I attach my EnergyDetector configuration file:
*** EnergyDetector Config File
***
loadFeatureFileExtension .prm
minLLK -200
maxLLK 1000
bigEndian false
loadFeatureFileFormat SPRO4
saveFeatureFileFormat SPRO4
saveFeatureFileSPro3DataKind FBCEPSTRA
featureServerBufferSize ALL_FEATURES
featureServerMemAlloc 50000000
featureFilesPath prm/
mixtureFilesPath gmm/
lstPath lst/
labelOutputFrames speech
labelSelectedFrames all
addDefaultLabel true
defaultLabel all
saveLabelFileExtension .lbl
labelFilesPath lbl/
frameLength 0.01
segmentalMode file
nbTrainIt 8
varianceFlooring 0.0001
varianceCeiling 1.5
alpha 0.25
mixtureDistribCount 3
featureServerMask 19
vectSize 1
baggedFrameProbabilityInit 0.1
thresholdMode weight
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.
It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.
The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:
import webrtcvad
vad = webrtcvad.Vad()
# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,
# and 10, 20, or 30 milliseconds long.
print vad.is_voiced(sample)
Hi pyAudioAnalysis has a silence removal functionality.
In this library, silence removal can be as simple as that:
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x,
Fs,
0.020,
0.020,
smoothWindow=1.0,
Weight=0.3,
plot=True)
silenceRemoval()
implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670
Internally silence removal()
follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.
Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With