Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find what time a part of audio starts and ends in another audio?

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.

A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."

structure of sample audio files

I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)

Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?

like image 497
Kadir Şahbaz Avatar asked Mar 21 '18 14:03

Kadir Şahbaz


2 Answers

So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.

Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.

  1. fastdtw
  2. pydtw
  3. mlpy
  4. rpy2

With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.

Please let me know if that doesn't help, Cheers!

like image 106
Haris Nadeem Avatar answered Sep 29 '22 16:09

Haris Nadeem


My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:

max = 0;
currentVolume = f(x)
if currentVolume > max
{
  max = currentVolume
}

Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.

We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.

Visual explanation

like image 24
Martin Meli Avatar answered Sep 29 '22 17:09

Martin Meli