I am using Microsoft's cognitive services. I have an audio input and need to identify multiple speakers and their individual text.
As per my understanding, Speaker Rekognition API can identify different individuals and Bing Speech API can convert speech to text. However, to do both at the same time, I need to manually split audio file into pieces (based on pause/silence) and then send the audio stream to individual services. Is there a better way to do it? Any other ecosystem that I should switch to like AWS Lex/Polly or Google's offerings?
In the transcript editor, click Edit Transcript to enter editing mode if needed. Find the cue whose speaker you need to identify, and click Speaker. Select the speaker you want from the list, as shown in the below figure.
You should try IBM Watson Speech to Text API. They have a feature called Speaker Diarization that will be useful for your use case.
More details here: https://www.ibm.com/blogs/watson/2016/12/look-whos-talking-ibm-debuts-watson-speech-text-speaker-diarization-beta/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With