Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does TensorFlow Audio/Speech Recognition work with multi-word trigger keywords?

Related link: https://www.tensorflow.org/tutorials/sequences/audio_recognition

How should I modify my TensorFlow "Simple Audio Recognition" training environment (number of input samples, choice of trigger keywords, training parameters, etc.) to get a robust recognition of a unique trigger keyword (multi-words or single-word) in a normal conversation?

The original TensorFlow "Simple Audio Recognition" comes with 10 single trigger keywords, each 1 second in duration. To avoid single trigger keywords to get detected in a normal conversation and cause false positives, I have recorded 400 times (100 times 4 different people) the following two multi-worded trigger keywords, each 1.5 seconds in duration: PLAY MUSIC, STOP MUSIC. After following the exact same training steps and compensating for the new 1.5 seconds duration in the code, I am getting 100 % recognition of these two multi-worded trigger keywords when pronounced correctly; however, further testing also shows that I am getting false positives during normal speech when any work of these trigger keywords is pronounced, e.g. STOP BLA BLA BLA, STOP VIDEO, PLAY BLA BLA BLA, PLAY VIDEO, etc.

Thank you for your kind response, PM

like image 476
P. Montazemi Avatar asked Feb 28 '26 23:02

P. Montazemi


1 Answers

You should have added garbage speech into training dataset, not sure if you did that.

For very long phrases, it is more reliable to detect smaller chunks and ensure they all are present - i.e. to have a separate detector for "play" and for "music".

For example, Google separately detects "ok" and "google" in their "ok google" as described in SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS .

like image 127
Nikolay Shmyrev Avatar answered Mar 02 '26 14:03

Nikolay Shmyrev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!