in my previous question on finding a reference audio sample in a bigger audio sample, it was proposed, that I should use convolution.
Using DSPUtil, I was able to do this. I played a little with it and tried different combinations of audio samples, to see what the result was. To visualize the data, I just dumped the raw audio as numbers to Excel and created a chart using this numbers. A peak is visible, but I don't really know how this helps me. I have these problems:
Any help is highly appreciated.
The following pictures are the result of the analysis using Excel:
UPDATE and solution:
Thanks to the extensive help of Han, I was able to achieve my goal.
After I rolled my own slow implementation without FFT, I found alglib which provides a fast implementation.
There is one basic assumption to my problem: One of the audio samples is contained completely within the other.
So, the following code returns the offset in samples in the larger of the two audio samples and the normalized cross-correlation value at that offset. 1 means complete correlation, 0 means no correlation at all and -1 means complete negative correlation:
private void CalcCrossCorrelation(IEnumerable<double> data1,
IEnumerable<double> data2,
out int offset,
out double maximumNormalizedCrossCorrelation)
{
var data1Array = data1.ToArray();
var data2Array = data2.ToArray();
double[] result;
alglib.corrr1d(data1Array, data1Array.Length,
data2Array, data2Array.Length, out result);
var max = double.MinValue;
var index = 0;
var i = 0;
// Find the maximum cross correlation value and its index
foreach (var d in result)
{
if (d > max)
{
index = i;
max = d;
}
++i;
}
// if the index is bigger than the length of the first array, it has to be
// interpreted as a negative index
if (index >= data1Array.Length)
{
index *= -1;
}
var matchingData1 = data1;
var matchingData2 = data2;
var biggerSequenceCount = Math.Max(data1Array.Length, data2Array.Length);
var smallerSequenceCount = Math.Min(data1Array.Length, data2Array.Length);
offset = index;
if (index > 0)
matchingData1 = data1.Skip(offset).Take(smallerSequenceCount).ToList();
else if (index < 0)
{
offset = biggerSequenceCount + smallerSequenceCount + index;
matchingData2 = data2.Skip(offset).Take(smallerSequenceCount).ToList();
matchingData1 = data1.Take(smallerSequenceCount).ToList();
}
var mx = matchingData1.Average();
var my = matchingData2.Average();
var denom1 = Math.Sqrt(matchingData1.Sum(x => (x - mx) * (x - mx)));
var denom2 = Math.Sqrt(matchingData2.Sum(y => (y - my) * (y - my)));
maximumNormalizedCrossCorrelation = max / (denom1 * denom2);
}
BOUNTY:
No new answers required! I started the bounty to award it to Han for his continued effort with this question!
Here we go for the bounty :)
To find a particular reference signal in a larger audio fragment, you need to use a cross-correlation algorithm. The basic formulae can be found in this Wikipedia article.
Cross-correlation is a process by which 2 signals are compared. This is done by multiplying both signals and summing the results for all samples. Then one of the signals is shifted (usually by 1 sample), and the calculation is repeated. If you try to visualize this for very simple signals such as a single impulse (e.g. 1 sample has a certain value while the remaining samples are zero), or a pure sine wave, you will see that the result of the cross-correlation is indeed a measure for for how much both signals are alike and the delay between them. Another article that may provide more insight can be found here.
This article by Paul Bourke also contains source code for a straightforward time-domain implementation. Note that the article is written for a general signal. Audio has the special property that the long-time average is usualy 0. This means that the averages used in Paul Bourkes formula (mx and my) can be left out. There are also fast implementations of the cross-correlation based on the FFT (see ALGLIB).
The (maximum) value of the correlation depends on the sample values in the audio signals. In Paul Bourke's algorithm however the maximum is scaled to 1.0. In cases where one of the signals is contained entirely within another signal, the maximum value will reach 1. In the more general case the maximum will be lower and a threshold value will have to be determined to decide whether the signals are sufficiently alike.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With