I've been working with Python speech recognition for the better part of a month now, making a JARVIS-like assistant. I've used both the Speech Recognition module with Google Speech API and Pocketsphinx, and I've used Pocketsphinx directly without another module. While the recognition is accurate, I've had a hard time working with the large amount of time these packages take to process speech. The way they seem to work is by recording from one point of silence to another, and then passing the recording to the STT engine. While the recording is being processed, no other sound can be recorded for recognition, which can be a problem if I'm trying to issue multiple complex commands in series.
When looking at the Google Assistant voice recognition, Alexa's voice recognition, or Mac OS High Sierra's offline recognition, I see words being recognized as I say them without any pause in the recording. I've seen this called realtime recognition, streaming recognition, and word-by-word recognition. Is there any way to do this in Python, preferably offline without using a client?
I tried (unsuccessfully) to accomplish this by changing pause threshold, speaking threshold, and non-speaking threshold for the SpeechRecognition recognizer, but that just caused the audio to segment strangely and still needed a second after each recognition before it could record again.
to recognize our speech, however recognise_google() doesn't work without internet connection.
Pocketsphinx can process streams, see here
Python pocketsphinx recognition from the microphone
Kaldi can process streams too (more accurate than pocketsphinx)
https://github.com/alphacep/kaldi-websocket-python/blob/master/test_local.py
Google speech API can also process streams, see here:
Google Streaming Speech Recognition on an Audio Stream Python
First of all, there is a python library called, VOSK. to install it on your computer type this command
pip3 install vosk
for more details please visit:
https://alphacephei.com/vosk/install
now we have to download the model for that go to this website and choose your preferred model and download it:
https://alphacephei.com/vosk/models
here I use " vosk-model-small-en-us-0.15
" as my model
after download, you can see it is a compressed file unzip it in your root folder, like this
speech-recognition/
├─ vosk-model-small-en-us-0.15 ( Unzip follder )
├─ offline-speech-recognition.py ( python file )
here is the full code :
from vosk import Model, KaldiRecognizer
import pyaudio
model = Model(r"C:\\Users\User\Desktop\python practice\ai\vosk-model-small-en-us-0.15")
recognizer = KaldiRecognizer(model, 16000)
mic = pyaudio.PyAudio()
stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
stream.start_stream()
while True:
data = stream.read(4096)
if recognizer.AcceptWaveform(data):
text = recognizer.Result()
print(f"' {text[14:-3]} '")
for more detail you can read this article I've written :
https://buddhi-ashen-dev.vercel.app/posts/offline-speech-recognition
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With