It's possible to use Google's Speech recognition API to get a transcription for an audio file (WAV, MP3, etc.) by doing a request to http://www.google.com/speech-api/v2/recognize?...
Example: I have said "one two three for five" in a WAV file. Google API gives me this:
{
u'alternative':
[
{u'transcript': u'12345'},
{u'transcript': u'1 2 3 4 5'},
{u'transcript': u'one two three four five'}
],
u'final': True
}
Question: is it possible to get the time (in seconds) at which each word has been said?
With my example:
['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.
i.e. the word "one" has been said between time 00:00:00.23 and 00:00:00.80,
the word "two" has been said between time 00:00:01.03 and 00:00:01.45 (in seconds).
PS: looking for an API supporting other languages than English, especially French.
Google ranked second, with transcript accuracy rate of 84 percent (error rate 16 percent).
A Speech-to-Text API synchronous recognition request is the simplest method for performing recognition on speech audio data. Speech-to-Text can process up to 1 minute of speech audio data sent in a synchronous request. After Speech-to-Text processes and recognizes all of the audio, it returns a response.
I believe the other answer is now out of date. This is now possible with the Google Cloud Search API: https://cloud.google.com/speech/docs/async-time-offsets
EDIT 2020: Now possible, see the other answers
It is not possible with google API.
If you want word timestamps, you can use other APIs, for example:
Vosk-API - free offline speech recognition API (disclosure: I am the primary author of Vosk).
SpeechMatics SaaS speech recognition API
Speech Recognition API from IBM
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With