How to find what time a part of audio starts and ends in another audio?

Question

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.

A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."

structure of sample audio files

I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)

Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?

Haris Nadeem · Accepted Answer

So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.

Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.

fastdtw
pydtw
mlpy
rpy2

With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.

Please let me know if that doesn't help, Cheers!

Martin Meli · Answer

My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:

max = 0;
currentVolume = f(x)
if currentVolume > max
{
  max = currentVolume
}

Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.

We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.

Visual explanation

How to find what time a part of audio starts and ends in another audio?

Tags:

pattern-matching

audio

audio-fingerprinting

Kadir Şahbaz

2 Answers

Haris Nadeem

Martin Meli

Recent Activity

Donate For Us

How to find what time a part of audio starts and ends in another audio?

Tags:

pattern-matching

audio

audio-fingerprinting

Kadir Şahbaz

2 Answers

Haris Nadeem

Martin Meli

Related questions

Recent Activity

Donate For Us