Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java voice recognition for very small dictionary

I have MP3 audio files that contain voicemails that are left by a computer.

The message content is always in same format and left by the same computer voice with only a slight variation in content:

"You sold 4 cars today" (where the 4 can be anything from 0 to 9).

I have be trying to set up Sphinx, but the out-of-the-box models did not work too good.

I then tried to write my own acoustic model and haven't had much better success yet (30% unrecognized is my best).

I am wondering if voice recognition might be overkill for this task since I have exactly ONE voice, an expected audio pattern and a very limited dictionary that would need to be recognized.

I have access to each of the ten sounds (spoken numbers) that I would need to search for in the message.

Is there a non-VR approach to finding sounds in an audio file (I can convert MP3 to another format if necessary).

Update: My solution to this task follows

After working with Nikolay directly, I learned that the answer to my original question is irrelevant since the desired results may be achieved (with 100% accuracy) using Sphinx4 and a JSGF grammar.

1: Since the speech in my audo files is very limited, I created a JSGF grammar (salesreport.gram) to describe it. All of the information I needed to create the following grammar was available on this JSpeech Grammar Format page.

#JSGF V1.0;

grammar salesreport;

public <salesreport> = (<intro> | <sales> | <closing>)+;

<intro> = this is your automated automobile sales report;

<sales> = you sold <digit> cars today;

<closing> = thank you for using this system;

<digit> = zero | one | two | three | four | five | six | seven | eight | nine;

NOTE: Sphinx does not support JSGF tags in the grammar. If necessary, a regular expression may be used to extract specific information (the number of sales in my case).

2: It is very important that your audio files are properly formatted. The default sample rate for Sphinx is 16Khz (16Khz means there are 16000 samples collected every second). I converted my MP3 audio files to WAV format using FFmpeg.

ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav

Unfortunately, FFmpeg renders this solution OS dependent. I am still looking for a way to convert the files using Java and will update this post if/when I find it.

Although it was not required to complete this task, I found Audacity helpful for working with audio files. It includes many utilities for working with the audio files (checking sample rate and bandwidth, file format conversion, etc).

3: Since telephone audio has a maximum bandwidth (the range of frequencies included in the audio) of 8kHz, I used the Sphinx en-us-8khz acoustic model.

4: I generated my dictionary, salesreport.dic, using lmtool

5: Using the files mentioned in the previous steps and the following code (modified version of Nikolay's example), my speech is recognized with 100% accuracy every time.

public String parseAudio(File voiceFile) throws FileNotFoundException, IOException
{
    String retVal = null;
    StringBuilder resultSB = new StringBuilder();

    Configuration configuration = new Configuration();

    configuration.setAcousticModelPath("file:acoustic_models/en-us-8khz");
    configuration.setDictionaryPath("file:salesreport.dic");
    configuration.setGrammarPath("file:salesreportResources/")
    configuration.setGrammarName("salesreport");
    configuration.setUseGrammar(true);

    StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
    try (InputStream stream = new FileInputStream(voiceFile))
    {
        recognizer.startRecognition(stream);

        SpeechResult result;

        while ((result = recognizer.getResult()) != null)
        {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
            resultSB.append(result.getHypothesis()
                    + " ");
        }

        recognizer.stopRecognition();
    }

    return resultSB.toString().trim();
}
like image 229
bigleftie Avatar asked Aug 26 '14 13:08

bigleftie


1 Answers

The accuracy on such task must be 100%. Here is the code sample to use with the grammar:

public class TranscriberDemoGrammar {

    public static void main(String[] args) throws Exception {
        System.out.println("Loading models...");

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("file:en-us-8khz");
        configuration.setDictionaryPath("cmu07a.dic");
        configuration.setGrammarPath("file:./");
        configuration.setGrammarName("digits");
        configuration.setUseGrammar(true);

        StreamSpeechRecognizer recognizer =
            new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new File("file.wav"));
        recognizer.startRecognition(stream);

        SpeechResult result;

        while ((result = recognizer.getResult()) != null) {

            System.out.format("Hypothesis: %s\n",
                              result.getHypothesis());
            }

        recognizer.stopRecognition();
    }
}

You also need to make sure that both sample rate and audio bandwidth matches the decoder configuration

http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy

like image 158
Nikolay Shmyrev Avatar answered Oct 11 '22 16:10

Nikolay Shmyrev