Does anyone know a (preferably C# .Net) library that would allow me to locate, in voice recordings, those segments in which a specific person is speaking?

It's possible with the toolkit SHoUT: http://shout-toolkit.sourceforge.net/index.html It's written in C++ and tested for Linux, but it should also run under Windows or OSX. The toolkit was a by-product of my PhD research on automatic speech recognition (ASR). Using it for ASR itself is perhaps not that straightforward, but for Speech Activity Detection (SAD) and diarization (finding all speech of one specific person) it is quite easy to use. Here is an example: <ol> <li>Create a headerless pcm audio file of 16KHz, 16bits, little-endian, mono. I use ffmpeg to create the raw files: ffmpeg -i [INPUT_FILE] -vn -acodec pcm_s16le -ar 16000 -ac 1 -f s16le [RAW_FILE] Prefix the headerless data with little endian encoded file size (4 bytes). Be sure the file has .raw extension, as shout_cluster detects file type based on extension.</li> <li>Perform speech/non-speech segmentation: ./shout_segment -a [RAW_FILE] -ams [SHOUT_SAD_MODEL] -mo [SAD_OUTPUT] The output file will provide you with segments in which someone is speaking (labeled with "SPEECH". Of course, because it is all done automatically, the system might make mistakes..), in which there is sound that is not speech ("SOUND"), or silence ("SILENCE").</li> <li>Perform diarization: ./shout_cluster -a [RAW_FILE] -mo [DIARIZATION_OUTPUT] -mi [SAD_OUTPUT] Using the output of the shout_segment, it will try to determine how many speakers were active in the recording, label each speaker ("SPK01", "SPK02", etc) and then find all speech segments of each of the speakers. </li> </ol> I hope this will help!

While the above answer is accurate, I have an update to the installation issue occured to me on Linux while installing SHoUT. <code>undefined reference to pthread_join</code> whose solution I found was to open configure-make.sh from SHoUT installation zip and modify the line <pre class="prettyprint"><code>CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-lpthread" ../configure </code></pre> to <pre class="prettyprint"><code>CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-pthread" ../configure </code></pre> NOTE the lpthread to changed to pthread on Linux Systems. OS: Linux Mint 18 where SHoUT version: release-2010-version-0-3

Identifying segments when a person is speaking?

2 Answers

It's possible with the toolkit SHoUT: http://shout-toolkit.sourceforge.net/index.html

It's written in C++ and tested for Linux, but it should also run under Windows or OSX.

The toolkit was a by-product of my PhD research on automatic speech recognition (ASR). Using it for ASR itself is perhaps not that straightforward, but for Speech Activity Detection (SAD) and diarization (finding all speech of one specific person) it is quite easy to use. Here is an example:

Create a headerless pcm audio file of 16KHz, 16bits, little-endian, mono. I use ffmpeg to create the raw files: ffmpeg -i [INPUT_FILE] -vn -acodec pcm_s16le -ar 16000 -ac 1 -f s16le [RAW_FILE] Prefix the headerless data with little endian encoded file size (4 bytes). Be sure the file has .raw extension, as shout_cluster detects file type based on extension.
Perform speech/non-speech segmentation: ./shout_segment -a [RAW_FILE] -ams [SHOUT_SAD_MODEL] -mo [SAD_OUTPUT] The output file will provide you with segments in which someone is speaking (labeled with "SPEECH". Of course, because it is all done automatically, the system might make mistakes..), in which there is sound that is not speech ("SOUND"), or silence ("SILENCE").
Perform diarization: ./shout_cluster -a [RAW_FILE] -mo [DIARIZATION_OUTPUT] -mi [SAD_OUTPUT] Using the output of the shout_segment, it will try to determine how many speakers were active in the recording, label each speaker ("SPK01", "SPK02", etc) and then find all speech segments of each of the speakers.

I hope this will help!

189

answered Nov 15 '22 12:11

Marijn Huijbregts

While the above answer is accurate, I have an update to the installation issue occured to me on Linux while installing SHoUT. undefined reference to pthread_join whose solution I found was to open configure-make.sh from SHoUT installation zip and modify the line

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-lpthread" ../configure

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-pthread" ../configure

NOTE the lpthread to changed to pthread on Linux Systems.

OS: Linux Mint 18 where SHoUT version: release-2010-version-0-3

answered Nov 15 '22 12:11

Muhammad Ahmad Mujtaba

Related questions
                            
                                Speech to Text (Voice Recognition) Directly from Audio / Transcription [closed]
                            
                                SpeechRecognition network error when working with electron / chromium browser
                            
                                Continuous listen the user voice and detect end of speech silence in SpeechKit framework
                            
                                How to handle dynamic input size for audio spectrogram used in CNN?
                            
                                Does the MS Speech Platform 11 Recognizer support ARPA compiled grammars?
                            
                                How to end Google Speech-to-Text streamingRecognize gracefully and get back the pending text results?
                            
                                Voice Recognition in PHP?
                            
                                Android Wear Custom Voice Actions
                            
                                SpeechRecognizer on Android device without Google Apps
                            
                                Dragon NaturallySpeaking Programmers
                            
                                Use x-webkit-speech with a textarea
                            
                                How do you recognize speech with the Python module Dragonfly?
                            
                                Android extras about speech recognition does not work
                            
                                Is there a way to use a grammar with the HTML 5 speech input API?
                            
                                Chrome iOS webkit speech-recognition
                            
                                How to get started with speech-to-text?
                            
                                ImportError: No module named 'speech_recognition' in python IDLE
                            
                                Very low accuracy while using open ears for speech recognition
                            
                                Speech Recognition Limits for iOS 10
                            
                                Difference between i-vector and d-vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Identifying segments when a person is speaking?

Tags:

speech-recognition

Avi

People also ask

2 Answers

Marijn Huijbregts

Muhammad Ahmad Mujtaba

Recent Activity

Donate For Us