I am using Googles this api :-
https://www.google.com/speech-api/v2/recognize?output=json&lang="+ language_code+"&key="My key"
for speech recognition and it's working very well.
The issue is with numbers i.e, if I say one two three four
the result will be 1234
and if I say one thousand two hundred thirty four
the result is still 1234
.
Another issue is that with other languages i.e. the word elf
in German is eleven
. If you say elf
the result is 11
, instead of elf.
I know we have no control over the api but is there any parameters or hacks we can add to this api to force it to return only words.
The response some times have the correct result but not always.
These are sample responses
1) When I say "one two three four"
{"result":[{"alternative":[{"transcript":"1234","confidence":0.47215959},{"transcript":"1 2 3 4","confidence":0.25},{"transcript":"one two three four","confidence":0.25},{"transcript":"1 2 34","confidence":0.33333334},{"transcript":"1 to 34","confidence":1}],"final":true}],"result_index":0}
2) When I say "one thousand two hundred thirty four"
{"result":[{"alternative":[{"transcript":"1234","confidence":0.94247383},{"transcript":"1.254","confidence":1},{"transcript":"1284","confidence":1},{"transcript":"1244","confidence":1},{"transcript":"1230 4","confidence":1}],"final":true}],"result_index":0}
What I have done.
Check if the result is a number, Then split each number by space and check if there is same sequence in the result array. In this e.g. Result 1234 becomes 1 2 3 4 and will search if there is a similar sequence in the result array and then convert it to words.In 2nd case there is no 1 2 3 4 so will stick with the original result.
This is the code.
String numberPattern = "[0-9]";
Pattern r1 = Pattern.compile(numberPattern);
Matcher m2 = r1.matcher(output);
if (m2.find()) {
char[] digits2 = output.toCharArray();
String digit = "";
for (char c: digits2) {
digit += c + " ";
}
for (int i = 1; i < jsonArray2.length(); i++) {
String value = jsonArray2.getJSONObject(i).getString("transcript");
if (digit.trim().equals(value.trim())) {
output = digit + " ";
}
}
}
So the issue is when I "say thirteen four eight" this method will split 13 as one three and hence not a reliable solution.
Update
I tried the new cloud vision api (https://cloud.google.com/speech/) and it's little better than the v2. The result for one two three four
is in words itself for which my workaround is working as well. But when I say thirteen four eight
it's still the same result as in v2.
And also elf is still 11 in German.
Also tried speech_context
that also didn't worked.
At a high level Speech to Text unit testing follows these steps: Gather sample audio files. Transcribe them (using transcriber or other tools) into Segment Time Mark (STM) format and aggregate them into one large file. These form your ground truth.
Text-to-Speech is priced based on the number of characters sent to the service to be synthesized into audio each month. You must enable billing to use Text-to-Speech, and will be automatically charged if your usage exceeds the number of free characters allowed per month.
Take a look at this question and answer.
You can give the API "speech context" hints, like this:
"speech_context": {
"phrases":["zero", "one", "two", ... "nine", "ten", "eleven", ... "twenty", "thirty,..., "ninety"]
}
I imagine this could work for other languages too, like German.
"speech_context": {
"phrases":["eins", "zwei", "drei", ..., "elf", "zwölf" ... ]
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With