Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transcribe MP3 audio file with Bing Speech API (speech to text)

I have a long recording (hour+) in the format of MP3. The following is the info i managed to get from FFMPEG about the audio file:

[mp3 @ 000001fe666da320] Skipping 0 bytes of junk at 58650.
[mjpeg @ 000001fe666effe0] Changing bps to 8
[mp3 @ 000001fe666da320] Estimating duration from bitrate, this may be inaccurate
Input #0, mp3, from '1.mp3':
Duration: 00:57:18.52, start: 0.000000, bitrate: 192 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, mono, s16p, 192 kb/s
    Stream #0:1: Video: mjpeg, yuvj420p(pc, bt470bg/unknown/unknown), 1300x1370, 90k tbr, 90k tbn, 90k tbc

I would like to use Bing Speech API (Microsoft Oxford - Cognitive Services - Speech API) to transcribe this file (speech to text).

I believe that this is achievable by using something like the code below.

Option 1: before sending up any audio data, you must first send up an SpeechAudioFormat descriptor to describe the layout and format of your raw audio data via DataRecognitionClient's sendAudioFormat() method. Can you provide a code sample for this option?

Option 2: converting the file to the target's acceptable format. I have done that with FFMPEG and this is what i got:

Duration: 00:57:23.67, bitrate: 256 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

As I understand from the documentation, this should be acceptable: The audio must be PCM, mono, 16-bit sample, with sample rate of 8000 Hz or 16000 Hz.

I tried to send the audio to the server but did not get any reply. Am I on the right tracks? What is the maximum buffer size?

Do u see other, maybe easier option to get my audio file transcribed?

private void SendAudioHelper(string wavFileName)
        {
            using (FileStream fileStream = new FileStream(wavFileName, FileMode.Open, FileAccess.Read))
            {
                int bytesRead = 0;
                byte[] buffer = new byte[1024];

                try
                {
                    do
                    {
                        // Get more Audio data to send into byte buffer.
                        bytesRead = fileStream.Read(buffer, 0, buffer.Length);

                        // Send of audio data to service.
                        this.dataClient.SendAudio(buffer, bytesRead);
                    }
                    while (bytesRead > 0);
                }
                finally
                {
                    // We are done sending audio.  Final recognition results will arrive in OnResponseReceived event call.
                    this.dataClient.EndAudio();
                }
            }
        }
like image 796
Rotem Varon Avatar asked Dec 15 '25 03:12

Rotem Varon


1 Answers

There is a limit of 15 seconds when you use the REST implementation. SDK has a limit of 2minutes.

Bing Speech team

like image 78
user7078407 Avatar answered Dec 16 '25 22:12

user7078407



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!