Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find sound effect inside an audio file

I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.

Is it possible to identify each time this sound effect is played, so I can note the time offsets?

The sound effect is similar every time, but because it's been encoded in a lossy file format, there will be a small amount of variation.

The time offsets will be stored in the ID3 Chapter Frame MetaData.


Example Source, where the sound effect plays twice.

ffmpeg -ss 0.9 -i source.mp3 -t 0.95 sample1.mp3 -acodec copy -y

  • Sample 1 (Spectrogram)

ffmpeg -ss 4.5 -i source.mp3 -t 0.95 sample2.mp3 -acodec copy -y

  • Sample 2 (Spectrogram)

I'm very new to audio processing, but my initial thought was to extract a sample of the 1 second sound effect, then use librosa in python to extract a floating point time series for both files, round the floating point numbers, and try to get a match.

import numpy
import librosa

print("Load files")

source_series, source_rate = librosa.load('source.mp3') # 3 hour file
sample_series, sample_rate = librosa.load('sample.mp3') # 1 second file

print("Round series")

source_series = numpy.around(source_series, decimals=5);
sample_series = numpy.around(sample_series, decimals=5);

print("Process series")

source_start = 0
sample_matching = 0
sample_length = len(sample_series)

for source_id, source_sample in enumerate(source_series):

    if source_sample == sample_series[sample_matching]:

        sample_matching += 1

        if sample_matching >= sample_length:

            print(float(source_start) / source_rate)

            sample_matching = 0

        elif sample_matching == 1:

            source_start = source_id;

    else:

        sample_matching = 0

This does not work with the MP3 files above, but did with an MP4 version - where it was able to find the sample I extracted, but it was only that one sample (not all 12).

I should also note this script takes just over 1 minute to process the 3 hour file (which includes 237,426,624 samples). So I can imagine that some kind of averaging on every loop would cause this to take considerably longer.

like image 771
Craig Francis Avatar asked Mar 05 '23 08:03

Craig Francis


2 Answers

Trying to directly match waveforms samples in the time domain is not a good idea. The mp3 signal will preserve the perceptual properties but it is quite likely the phases of the frequency components will be shifted so the sample values will not match.

You could try trying to match the volume envelopes of your effect and your sample. This is less likely to be affected by the mp3 process.

First, normalise your sample so the embedded effects are the same level as your reference effect. Constructing new waveforms from the effect and the sample by using the average of the peak values over time frames that are just short enough to capture the relevant features. Better still use overlapping frames. Then use cross-correlation in the time domain.

If this does not work then you could analyze each frame using an FFT this gives you a feature vector for each frame. You then try to find matches of the sequence of features in your effect with the sample. Similar to https://stackoverflow.com/users/1967571/jonnor suggestion. MFCC is used in speech recognition but since you are not detecting speech FFT is probably OK.

I am assuming the effect playing by itself (no background noise) and it is added to the recording electronically (as opposed to being recorded via a microphone). If this is not the case the problem becomes more difficult.

like image 130
Paul John Leonard Avatar answered Mar 07 '23 22:03

Paul John Leonard


This is an Audio Event Detection problem. If the sound is always the same and there are no other sounds at the same time, it can probably be solved with a Template Matching approach. At least if there is no other sounds with other meanings that sound similar.

The simplest kind of template matching is to compute the cross-correlation between your input signal and the template.

  1. Cut out an example of the sound to detect (using Audacity). Take as much as possible, but avoid the start and end. Store this as .wav file
  2. Load the .wav template using librosa.load()
  3. Chop up the input file into a series of overlapping frames. Length should be same as your template. Can be done with librosa.util.frame
  4. Iterate over the frames, and compute cross-correlation between frame and template using numpy.correlate.
  5. High values of cross-correlation indicate a good match. A threshold can be applied in order to decide what is an event or not. And the frame number can be used to calculate the time of the event.

You should probably prepare some shorter test files which have both some examples of the sound to detect as well as other typical sounds.

If the volume of the recordings is inconsistent you'll want to normalize that before running detection.

If cross-correlation in the time-domain does not work, you can compute the melspectrogram or MFCC features and cross-correlate that. If this does not yield OK results either, a machine learning model can be trained using supervised learning, but this requires labeling a bunch of data as event/not-event.

like image 22
Jon Nordby Avatar answered Mar 08 '23 00:03

Jon Nordby