Use convolution to find a reference audio sample in a continuous stream of sound

Question

in my previous question on finding a reference audio sample in a bigger audio sample, it was proposed, that I should use convolution.
Using DSPUtil, I was able to do this. I played a little with it and tried different combinations of audio samples, to see what the result was. To visualize the data, I just dumped the raw audio as numbers to Excel and created a chart using this numbers. A peak is visible, but I don't really know how this helps me. I have these problems:

I don't know, how to infer the starting position of the match in the original audio sample from the location of the peak.
I don't know, how I should apply this with a continuous stream of audio, so I can react, as soon as the reference audio sample occurs.
I don't understand, why picture 2 and picture 4 (see below) differ so much, although, both represent an audio sample convolved with itself...

Any help is highly appreciated.

The following pictures are the result of the analysis using Excel:

A longer audio sample with the reference audio (a beep) near the end:
The beep convolved with itself:
A longer audio sample without the beep convolved with the beep:
The longer audio sample of point 3 convolved with itself:

UPDATE and solution:
Thanks to the extensive help of Han, I was able to achieve my goal.
After I rolled my own slow implementation without FFT, I found alglib which provides a fast implementation. There is one basic assumption to my problem: One of the audio samples is contained completely within the other.
So, the following code returns the offset in samples in the larger of the two audio samples and the normalized cross-correlation value at that offset. 1 means complete correlation, 0 means no correlation at all and -1 means complete negative correlation:

private void CalcCrossCorrelation(IEnumerable<double> data1, 
                                  IEnumerable<double> data2, 
                                  out int offset, 
                                  out double maximumNormalizedCrossCorrelation)
{
    var data1Array = data1.ToArray();
    var data2Array = data2.ToArray();
    double[] result;
    alglib.corrr1d(data1Array, data1Array.Length, 
                   data2Array, data2Array.Length, out result);

    var max = double.MinValue;
    var index = 0;
    var i = 0;
    // Find the maximum cross correlation value and its index
    foreach (var d in result)
    {
        if (d > max)
        {
            index = i;
            max = d;
        }
        ++i;
    }
    // if the index is bigger than the length of the first array, it has to be
    // interpreted as a negative index
    if (index >= data1Array.Length)
    {
        index *= -1;
    }

    var matchingData1 = data1;
    var matchingData2 = data2;
    var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
    var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
    offset = index;
    if (index > 0)
        matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
    else if (index < 0)
    {
        offset = biggerSequenceCount + smallerSequenceCount + index;
        matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
        matchingData1 = data1.Take(smallerSequenceCount).ToList();
    }
    var mx = matchingData1.Average();
    var my = matchingData2.Average();
    var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
    var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
    maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}

BOUNTY:
No new answers required! I started the bounty to award it to Han for his continued effort with this question!

Han · Accepted Answer

Here we go for the bounty :)

To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.

Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.

This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out. There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).

The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.

Use convolution to find a reference audio sample in a continuous stream of sound

Tags:

c#

.net

signal-processing

convolution

cross-correlation

Daniel Hilgarth

1 Answers

Han

Recent Activity

Donate For Us

Use convolution to find a reference audio sample in a continuous stream of sound

Tags:

c#

.net

signal-processing

convolution

cross-correlation

Daniel Hilgarth

1 Answers

Han

Related questions

Recent Activity

Donate For Us