I've been working with Python speech recognition for the better part of a month now, making a JARVIS-like assistant. I've used both the Speech Recognition module with Google Speech API and Pocketsphinx, and I've used Pocketsphinx directly without another module. While the recognition is accurate, I've had a hard time working with the large amount of time these packages take to process speech. The way they seem to work is by recording from one point of silence to another, and then passing the recording to the STT engine. While the recording is being processed, no other sound can be recorded for recognition, which can be a problem if I'm trying to issue multiple complex commands in series. When looking at the Google Assistant voice recognition, Alexa's voice recognition, or Mac OS High Sierra's offline recognition, I see words being recognized as I say them without any pause in the recording. I've seen this called realtime recognition, streaming recognition, and word-by-word recognition. Is there any way to do this in Python, preferably offline without using a client? I tried (unsuccessfully) to accomplish this by changing pause threshold, speaking threshold, and non-speaking threshold for the SpeechRecognition recognizer, but that just caused the audio to segment strangely and still needed a second after each recognition before it could record again.

Pocketsphinx can process streams, see here Python pocketsphinx recognition from the microphone Kaldi can process streams too (more accurate than pocketsphinx) https://github.com/alphacep/kaldi-websocket-python/blob/master/test_local.py Google speech API can also process streams, see here: Google Streaming Speech Recognition on an Audio Stream Python

First of all, there is a python library called, VOSK. to install it on your computer type this command <pre class="prettyprint"><code>pip3 install vosk </code></pre> for more details please visit: <code>https://alphacephei.com/vosk/install</code> now we have to download the model for that go to this website and choose your preferred model and download it: <code>https://alphacephei.com/vosk/models</code> here I use " <code>vosk-model-small-en-us-0.15</code> " as my model after download, you can see it is a compressed file unzip it in your root folder, like this <pre class="prettyprint"><code>speech-recognition/ ├─ vosk-model-small-en-us-0.15 ( Unzip follder ) ├─ offline-speech-recognition.py ( python file ) </code></pre> here is the full code : <pre class="prettyprint lang-py prettyprint-override"><code> from vosk import Model, KaldiRecognizer import pyaudio model = Model(r"C:\\Users\User\Desktop\python practice\ai\vosk-model-small-en-us-0.15") recognizer = KaldiRecognizer(model, 16000) mic = pyaudio.PyAudio() stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192) stream.start_stream() while True: data = stream.read(4096) if recognizer.AcceptWaveform(data): text = recognizer.Result() print(f"' {text[14:-3]} '") </code></pre> for more detail you can read this article I've written : <code>https://buddhi-ashen-dev.vercel.app/posts/offline-speech-recognition</code>

Realtime offline speech recognition in Python

I've been working with Python speech recognition for the better part of a month now, making a JARVIS-like assistant. I've used both the Speech Recognition module with Google Speech API and Pocketsphinx, and I've used Pocketsphinx directly without another module. While the recognition is accurate, I've had a hard time working with the large amount of time these packages take to process speech. The way they seem to work is by recording from one point of silence to another, and then passing the recording to the STT engine. While the recording is being processed, no other sound can be recorded for recognition, which can be a problem if I'm trying to issue multiple complex commands in series.

When looking at the Google Assistant voice recognition, Alexa's voice recognition, or Mac OS High Sierra's offline recognition, I see words being recognized as I say them without any pause in the recording. I've seen this called realtime recognition, streaming recognition, and word-by-word recognition. Is there any way to do this in Python, preferably offline without using a client?

I tried (unsuccessfully) to accomplish this by changing pause threshold, speaking threshold, and non-speaking threshold for the SpeechRecognition recognizer, but that just caused the audio to segment strangely and still needed a second after each recognition before it could record again.

Does python speech recognition require Internet?

to recognize our speech, however recognise_google() doesn't work without internet connection.

Pocketsphinx can process streams, see here

Python pocketsphinx recognition from the microphone

Kaldi can process streams too (more accurate than pocketsphinx)

https://github.com/alphacep/kaldi-websocket-python/blob/master/test_local.py

Google speech API can also process streams, see here:

Google Streaming Speech Recognition on an Audio Stream Python

First of all, there is a python library called, VOSK. to install it on your computer type this command

pip3 install vosk

for more details please visit:

https://alphacephei.com/vosk/install

now we have to download the model for that go to this website and choose your preferred model and download it:

https://alphacephei.com/vosk/models here I use " vosk-model-small-en-us-0.15 " as my model

after download, you can see it is a compressed file unzip it in your root folder, like this

speech-recognition/
    ├─ vosk-model-small-en-us-0.15 ( Unzip follder ) 
    ├─ offline-speech-recognition.py ( python file )

here is the full code :

    from vosk import Model, KaldiRecognizer
    import pyaudio
    
    model = Model(r"C:\\Users\User\Desktop\python practice\ai\vosk-model-small-en-us-0.15")
    recognizer = KaldiRecognizer(model, 16000)
    
    mic = pyaudio.PyAudio()
    stream = mic.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8192)
    stream.start_stream()
    
    while True:
        data = stream.read(4096)
        
    
        if recognizer.AcceptWaveform(data):
            text = recognizer.Result()
            print(f"' {text[14:-3]} '")

for more detail you can read this article I've written : https://buddhi-ashen-dev.vercel.app/posts/offline-speech-recognition

Realtime offline speech recognition in Python

Tags:

python

real-time

speech-recognition

Elias N-d

People also ask

2 Answers

Nikolay Shmyrev

Buddhi ashen

Recent Activity

Donate For Us

Realtime offline speech recognition in Python

Tags:

python

real-time

speech-recognition

Elias N-d

People also ask

2 Answers

Nikolay Shmyrev

Buddhi ashen

Related questions

Recent Activity

Donate For Us