Algorithm for concatenating speech audio to sound continuous?

Question

I'm building a simple program that speaks phone numbers in a human voice.

For that I pre-recorded each digit (with different intonations), and when I get a number I join the audio files and play them together with some silence added between the numbers.

However, this doesn't sound smooth or natural.

I tried to do gain and tempo normalization on the files but it feels like I need to join them in some "smart" way so that the transition will sound natural.

I looked for some algorithms to do that but didn't find anything.

Is there are a known method for that?

Thanks.

Nikolay Shmyrev · Accepted Answer

The algorithm is called PSOLA. There are variations like TD-PSOLA.

Overall there are many things here - how to decide which items to join based on acoustic properties, source intonation and required target intonation. It is all pretty complex to implement so it is better to use existing open source TTS systems and existing synthesizers which have all the things covered. You can check festvox or Openmary.

Algorithm for concatenating speech audio to sound continuous?

Tags:

audio

speech

text-to-speech

Ran

1 Answers

Nikolay Shmyrev

Recent Activity

Donate For Us

Algorithm for concatenating speech audio to sound continuous?

Tags:

audio

speech

text-to-speech

Ran

1 Answers

Nikolay Shmyrev

Related questions

Recent Activity

Donate For Us