I am writing a program that recognizes speech. What it does is it records audio from the microphone and converts it to text using Sphinx. My problem is I want to start recording audio only when something is spoken by the user.
I experimented by reading the audio levels from the microphone and recording only when the level is above a particular value. But it ain't that effective. The program starts recording whenever it detects anything loud. This is the code I used
import audioop
import pyaudio as pa
import wav
class speech():
def __init__(self):
# soundtrack properties
self.format = pa.paInt16
self.rate = 16000
self.channel = 1
self.chunk = 1024
self.threshold = 150
self.file = 'audio.wav'
# intialise microphone stream
self.audio = pa.PyAudio()
self.stream = self.audio.open(format=self.format,
channels=self.channel,
rate=self.rate,
input=True,
frames_per_buffer=self.chunk)
def record(self)
while True:
data = self.stream.read(self.chunk)
rms = audioop.rms(data,2) #get input volume
if rms>self.threshold: #if input volume greater than threshold
break
# array to store frames
frames = []
# record upto silence only
while rms>threshold:
data = self.stream.read(self.chunk)
rms = audioop.rms(data,2)
frames.append(data)
print 'finished recording.... writing file....'
write_frames = wav.open(self.file, 'wb')
write_frames.setnchannels(self.channel)
write_frames.setsampwidth(self.audio.get_sample_size(self.format))
write_frames.setframerate(self.rate)
write_frames.writeframes(''.join(frames))
write_frames.close()
Is there a way I can differentiate between human voice and other noise in Python ? Hope somebody can find me a solution.
I think that your issue is that at the moment you are trying to record without recognition of the speech so it is not discriminating - recognisable speech is anything that gives meaningful results after recognition - so catch 22. You could simplify matters by looking for an opening keyword. You can also filter on voice frequency range as the human ear and the telephone companies both do and you can look at the mark space ratio - I believe that there were some publications a while back on that but look out - it varies from language to language. A quick Google can be very informative. You may also find this article interesting.
I think waht you are looking for is VAD (voice activity detection). VAD can be used for preprocessing speech for ASR. Here is some open-source project for implements of VAD link. May it help you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With