Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for concatenating speech audio to sound continuous?

I'm building a simple program that speaks phone numbers in a human voice.

For that I pre-recorded each digit (with different intonations), and when I get a number I join the audio files and play them together with some silence added between the numbers.

However, this doesn't sound smooth or natural.

I tried to do gain and tempo normalization on the files but it feels like I need to join them in some "smart" way so that the transition will sound natural.

I looked for some algorithms to do that but didn't find anything.

Is there are a known method for that?

Thanks.

like image 651
Ran Avatar asked Oct 25 '17 04:10

Ran


1 Answers

The algorithm is called PSOLA. There are variations like TD-PSOLA.

Overall there are many things here - how to decide which items to join based on acoustic properties, source intonation and required target intonation. It is all pretty complex to implement so it is better to use existing open source TTS systems and existing synthesizers which have all the things covered. You can check festvox or Openmary.

like image 77
Nikolay Shmyrev Avatar answered Oct 23 '22 02:10

Nikolay Shmyrev