Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

festival 2.4: why do some voices not work with singing mode?

voice_kal_diphone and voice_ral_diphone work correctly in singing mode (there's vocal output and the pitches are correct for the specified notes).

voice_cmu_us_ahw_cg and the other CMU voices do not work correctly--there's vocal output but the pitch is not changed according to the specified notes.

Is it possible to get correct output with the higher quality CMU voices?

The command line for working (pitch-affected) output is:

text2wave -mode singing -eval "(voice_kal_diphone)" -o song.wav song.xml

The command line for non-working (pitch-unaffected) output is:

text2wave -mode singing -eval "(voice_cmu_us_ahw_cg)" -o song.wav song.xml

Here's song.xml:

<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []>
<SINGING BPM="60">
 <PITCH NOTE="A4,C4,C4"><DURATION BEATS="0.3,0.3,0.3">nationwide</DURATION></PITCH>
 <PITCH NOTE="C4"><DURATION BEATS="0.3">is</DURATION></PITCH>
 <PITCH NOTE="D4"><DURATION BEATS="0.3">on</DURATION></PITCH>
 <PITCH NOTE="F4"><DURATION BEATS="0.3">your</DURATION></PITCH>
 <PITCH NOTE="F4"><DURATION BEATS="0.3">side</DURATION></PITCH>
</SINGING>

You may also need this patch to singing-mode.scm:

@@ -339,7 +339,9 @@
 (defvar singing-max-short-vowel-length 0.11)

 (define (singing_do_initial utt token)
-  (if (equal? (item.name token) "")
+  (if (and
+        (not (equal? nil token))
+        (equal? (item.name token) ""))
       (let ((restlen (car (item.feat token 'rest))))
         (if singing-debug
             (format t "restlen %l\n" restlen))

To set up my environment I used the festvox fest_build script. You can also download voice_cmu_us_ahw_cg separately.

like image 488
Beau Avatar asked Nov 10 '22 00:11

Beau


1 Answers

It seems that the problem is in phones generation.

voice_kal_diphone uses UniSyn synthesis model, while voice_cmu_us_ahw_cg uses ClusterGen model. The last one has own intonation and duration model (state-based) instead of phone intonation/duration: possibly you noticed that duration didn't changed too in generated 'song'.

singing-mode.scm tries to extract each syllable and modify its frequency. In case of ClusterGen model wave generator simply ignores syllables frequencies and durations set in Target due to different modelling.

As a result we have better voice quality (based on statistic model), but can't change frequency directly.

Very good description of generation pipeline can be found here.

like image 176
avtomaton Avatar answered Dec 28 '22 14:12

avtomaton