I have MP3 audio files that contain voicemails that are left by a computer.
The message content is always in same format and left by the same computer voice with only a slight variation in content:
"You sold 4 cars today" (where the 4 can be anything from 0 to 9).
I have be trying to set up Sphinx, but the out-of-the-box models did not work too good.
I then tried to write my own acoustic model and haven't had much better success yet (30% unrecognized is my best).
I am wondering if voice recognition might be overkill for this task since I have exactly ONE voice, an expected audio pattern and a very limited dictionary that would need to be recognized.
I have access to each of the ten sounds (spoken numbers) that I would need to search for in the message.
Is there a non-VR approach to finding sounds in an audio file (I can convert MP3 to another format if necessary).
Update: My solution to this task follows
After working with Nikolay directly, I learned that the answer to my original question is irrelevant since the desired results may be achieved (with 100% accuracy) using Sphinx4 and a JSGF grammar.
1: Since the speech in my audo files is very limited, I created a JSGF grammar (salesreport.gram) to describe it. All of the information I needed to create the following grammar was available on this JSpeech Grammar Format page.
#JSGF V1.0;
grammar salesreport;
public <salesreport> = (<intro> | <sales> | <closing>)+;
<intro> = this is your automated automobile sales report;
<sales> = you sold <digit> cars today;
<closing> = thank you for using this system;
<digit> = zero | one | two | three | four | five | six | seven | eight | nine;
NOTE: Sphinx does not support JSGF tags in the grammar. If necessary, a regular expression may be used to extract specific information (the number of sales in my case).
2: It is very important that your audio files are properly formatted. The default sample rate for Sphinx is 16Khz (16Khz means there are 16000 samples collected every second). I converted my MP3 audio files to WAV format using FFmpeg.
ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav
Unfortunately, FFmpeg renders this solution OS dependent. I am still looking for a way to convert the files using Java and will update this post if/when I find it.
Although it was not required to complete this task, I found Audacity helpful for working with audio files. It includes many utilities for working with the audio files (checking sample rate and bandwidth, file format conversion, etc).
3: Since telephone audio has a maximum bandwidth (the range of frequencies included in the audio) of 8kHz, I used the Sphinx en-us-8khz acoustic model.
4: I generated my dictionary, salesreport.dic, using lmtool
5: Using the files mentioned in the previous steps and the following code (modified version of Nikolay's example), my speech is recognized with 100% accuracy every time.
public String parseAudio(File voiceFile) throws FileNotFoundException, IOException
{
String retVal = null;
StringBuilder resultSB = new StringBuilder();
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:acoustic_models/en-us-8khz");
configuration.setDictionaryPath("file:salesreport.dic");
configuration.setGrammarPath("file:salesreportResources/")
configuration.setGrammarName("salesreport");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
try (InputStream stream = new FileInputStream(voiceFile))
{
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null)
{
System.out.format("Hypothesis: %s\n", result.getHypothesis());
resultSB.append(result.getHypothesis()
+ " ");
}
recognizer.stopRecognition();
}
return resultSB.toString().trim();
}
The accuracy on such task must be 100%. Here is the code sample to use with the grammar:
public class TranscriberDemoGrammar {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("file:en-us-8khz");
configuration.setDictionaryPath("cmu07a.dic");
configuration.setGrammarPath("file:./");
configuration.setGrammarName("digits");
configuration.setUseGrammar(true);
StreamSpeechRecognizer recognizer =
new StreamSpeechRecognizer(configuration);
InputStream stream = new FileInputStream(new File("file.wav"));
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n",
result.getHypothesis());
}
recognizer.stopRecognition();
}
}
You also need to make sure that both sample rate and audio bandwidth matches the decoder configuration
http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With