Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying segments when a person is speaking?

Does anyone know a (preferably C# .Net) library that would allow me to locate, in voice recordings, those segments in which a specific person is speaking?

like image 563
Avi Avatar asked Nov 27 '11 10:11

Avi


People also ask

How are speeches segmented?

Speech segmentation is the process by which the brain determines where one meaningful unit (e.g., word or morpheme) ends and the next begins in continuous speech, and it is critical for auditory language processing.

What is segmentation in language?

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks.

What is speech segmentation and why is it a problem?

The segmentation and word discovery problem arises because speech does not contain any reliable acoustic analog of the blank spaces between words of printed English. As a result, children must segment the utterances they hear in order to discover the sound patterns of individual words in their language.

How does Speaker Diarization work?

As we explained above, speaker diarization transcription involves chopping up an audio recording file into shorter, single-speaker segments and embedding the segments of speech into a space that represents each individual speaker's unique characteristics. Then, those segments are clustered and prepared for labeling.


2 Answers

It's possible with the toolkit SHoUT: http://shout-toolkit.sourceforge.net/index.html

It's written in C++ and tested for Linux, but it should also run under Windows or OSX.

The toolkit was a by-product of my PhD research on automatic speech recognition (ASR). Using it for ASR itself is perhaps not that straightforward, but for Speech Activity Detection (SAD) and diarization (finding all speech of one specific person) it is quite easy to use. Here is an example:

  1. Create a headerless pcm audio file of 16KHz, 16bits, little-endian, mono. I use ffmpeg to create the raw files: ffmpeg -i [INPUT_FILE] -vn -acodec pcm_s16le -ar 16000 -ac 1 -f s16le [RAW_FILE] Prefix the headerless data with little endian encoded file size (4 bytes). Be sure the file has .raw extension, as shout_cluster detects file type based on extension.

  2. Perform speech/non-speech segmentation: ./shout_segment -a [RAW_FILE] -ams [SHOUT_SAD_MODEL] -mo [SAD_OUTPUT] The output file will provide you with segments in which someone is speaking (labeled with "SPEECH". Of course, because it is all done automatically, the system might make mistakes..), in which there is sound that is not speech ("SOUND"), or silence ("SILENCE").

  3. Perform diarization: ./shout_cluster -a [RAW_FILE] -mo [DIARIZATION_OUTPUT] -mi [SAD_OUTPUT] Using the output of the shout_segment, it will try to determine how many speakers were active in the recording, label each speaker ("SPK01", "SPK02", etc) and then find all speech segments of each of the speakers.

I hope this will help!

like image 189
Marijn Huijbregts Avatar answered Nov 15 '22 12:11

Marijn Huijbregts


While the above answer is accurate, I have an update to the installation issue occured to me on Linux while installing SHoUT. undefined reference to pthread_join whose solution I found was to open configure-make.sh from SHoUT installation zip and modify the line

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-lpthread" ../configure

to

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-pthread" ../configure

NOTE the lpthread to changed to pthread on Linux Systems.

OS: Linux Mint 18 where SHoUT version: release-2010-version-0-3

like image 40
Muhammad Ahmad Mujtaba Avatar answered Nov 15 '22 12:11

Muhammad Ahmad Mujtaba