I'm trying to make beat detection using PC microphone and then with timestamp of beat calculate distance between multiple successive beats. I have chosen python because there is plenty of material available and it's quick to develop. By searching the internet I have come up with this simple code (no advanced peak detection or anything yet, this comes later if need be):
import pyaudio
import struct
import math
import time
SHORT_NORMALIZE = (1.0/32768.0)
def get_rms(block):
# RMS amplitude is defined as the square root of the
# mean over time of the square of the amplitude.
# so we need to convert this string of bytes into
# a string of 16-bit samples...
# we will get one short out for each
# two chars in the string.
count = len(block)/2
format = "%dh" % (count)
shorts = struct.unpack(format, block)
# iterate over the block.
sum_squares = 0.0
for sample in shorts:
# sample is a signed short in +/- 32768.
# normalize it to 1.0
n = sample * SHORT_NORMALIZE
sum_squares += n*n
return math.sqrt(sum_squares / count)
CHUNK = 32
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
elapsed_time = 0
prev_detect_time = 0
while True:
data = stream.read(CHUNK)
amplitude = get_rms(data)
if amplitude > 0.05: # value set by observing graphed data captured from mic
elapsed_time = time.perf_counter() - prev_detect_time
if elapsed_time > 0.1: # guard against multiple spikes at beat point
print(elapsed_time)
prev_detect_time = time.perf_counter()
def close_stream():
stream.stop_stream()
stream.close()
p.terminate()
The code works pretty good in silence, and I have been pretty satisfied the first two moments I ran it, but then I tried how accurate it was and I was a little bit less satisfied. To test this I used two methods: phone with metronome set to 60bpm (emits tic toc sounds into microphone) and an Arduino hooked to a beeper, which is triggered at 1Hz rate by accurate Chronodot RTC. The beeper beeps into microphone, triggering a detection. With both methods results look similar (numbers represent distance between two beat detections in seconds):
0.9956681643835616
1.0056331689497717
0.9956100091324198
1.0058207853881278
0.9953449497716891
1.0052103013698623
1.0049350136986295
0.9859074337899543
1.004996383561644
0.9954095342465745
1.0061518904109583
0.9953025753424658
1.0051235068493156
1.0057199634703196
0.984839305936072
1.00610396347032
0.9951862648401821
1.0053146301369864
0.9960100821917806
1.0053391780821919
0.9947373881278523
1.0058608219178105
1.0056580091324214
0.9852110319634697
1.0054473059360731
0.9950465753424638
1.0058237077625556
0.995704694063928
1.0054566575342463
0.9851026118721435
1.0059882374429243
1.0052523835616398
0.9956161461187207
1.0050863926940607
0.9955758173515932
1.0058052968036577
0.9953960913242028
1.0048014611872205
1.006336876712325
0.9847434520547935
1.0059712876712297
Now I'm pretty confident that at least Arduino is accurate to 1 msec (which is targeted accuracy). The results tend to be off by +- 5msec, but now and then even 15ms, which is unacceptable. Is there a way to achieve greater accuracy or is this limitation of python / soundcard / something else? Thank you!
EDIT: After incorporating tom10 and barny's suggestions into the code, the code looks like this:
import pyaudio
import struct
import math
import psutil
import os
def set_high_priority():
p = psutil.Process(os.getpid())
p.nice(psutil.HIGH_PRIORITY_CLASS)
SHORT_NORMALIZE = (1.0/32768.0)
def get_rms(block):
# RMS amplitude is defined as the square root of the
# mean over time of the square of the amplitude.
# so we need to convert this string of bytes into
# a string of 16-bit samples...
# we will get one short out for each
# two chars in the string.
count = len(block)/2
format = "%dh" % (count)
shorts = struct.unpack(format, block)
# iterate over the block.
sum_squares = 0.0
for sample in shorts:
# sample is a signed short in +/- 32768.
# normalize it to 1.0
n = sample * SHORT_NORMALIZE
sum_squares += n*n
return math.sqrt(sum_squares / count)
CHUNK = 4096
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RUNTIME_SECONDS = 10
set_high_priority()
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
elapsed_time = 0
prev_detect_time = 0
TIME_PER_CHUNK = 1000 / RATE * CHUNK
SAMPLE_GROUP_SIZE = 32 # 1 sample = 2 bytes, group is closest to 1 msec elapsing
TIME_PER_GROUP = 1000 / RATE * SAMPLE_GROUP_SIZE
for i in range(0, int(RATE / CHUNK * RUNTIME_SECONDS)):
data = stream.read(CHUNK)
time_in_chunk = 0
group_index = 0
for j in range(0, len(data), (SAMPLE_GROUP_SIZE * 2)):
group = data[j:(j + (SAMPLE_GROUP_SIZE * 2))]
amplitude = get_rms(group)
amplitudes.append(amplitude)
if amplitude > 0.02:
current_time = (elapsed_time + time_in_chunk)
time_since_last_beat = current_time - prev_detect_time
if time_since_last_beat > 500:
print(time_since_last_beat)
prev_detect_time = current_time
time_in_chunk = (group_index+1) * TIME_PER_GROUP
group_index += 1
elapsed_time = (i+1) * TIME_PER_CHUNK
stream.stop_stream()
stream.close()
p.terminate()
With this code I achieved the following results (units are this time milliseconds instead of seconds):
999.909297052154
999.9092970521542
999.9092970521542
999.9092970521542
999.9092970521542
1000.6349206349205
999.9092970521551
999.9092970521524
999.9092970521542
999.909297052156
999.9092970521542
999.9092970521542
999.9092970521524
999.9092970521542
Which, if I didn't make any mistake, looks a lot better than before and has achieved sub-millisecond accuracy. I thank tom10 and barny for their help.
The reason you're not getting the right timing for the beats is that you're missing chunks of the audio data. That is, the chunks are being read by the soundcard, but you're not collecting the data before it's overwritten with the next chunk.
First, though, for this problem you need to distinguish between the ideas of timing accuracy and real-time response.
The timing accuracy of a sound card should be very good, much better than a ms, and you should be able to capture all of this accuracy in the data you read from the soundcard. The real-time responsiveness of your computer's OS should be very bad, much worse than a ms. That is, you should easily be able to identify audio events (such as beats) to within a ms, but not identify them at the time they happen (instead, 30-200ms later depending on your system). This arrangement usually works for computers because general human perception of the timing of events is much greater than a ms (except for rare specialized percepetual systems, like comparing auditory events between the two ears, etc).
The specific problem with your code is that CHUNKS is much too small for the OS to query the sound card at each sample. You have it at 32, so at 44100Hz, the OS needs to get to the sound card every 0.7ms, which is too short of a time for a computer that's tasked with doing many other things. If you OS doesn't get the chunk before the next one comes in, the original chunk is overwritten and lost.
To get this working so it's consistent with the constraints above, make CHUNKS much larger than 32, and more like 1024 (as in the PyAudio examples). Depending on your computer and what it's doing, even that my not be long enough.
If this type of approach won't work for you, you will probably need a dedicated real-time system like an Arduino. (Generally, though, this isn't necessary, so think twice before you decide that you need to use the Arduino. Usually, when I've seen people need true real-time it's when trying to do something very quantitave interactive with the human, like flash a light, have the person tap a button, flash another light, have the person tap another button, etc, to measure response times.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With