Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text to speech with avatar lip sync, no plug-ins

Is there a JavaScript library or product that exists that provides text-to-speech for animated, speaking avatars, that does not use flash or any other plug-in. The idea is that I type in text and the avatars mouth moves as audio is played.

The aim is a cross-browser, cross device, no-plugins, web-based talking chat avatar.

I looked at CrazyTalk, which seemed perfect, but sadly it turns out that that relies on the unity engine.

I then started to think about rolling my own by combining existing text to speech services and trying to pull phonemes out of an audio wave and make my own dictionary of phonemes to canvas shapes. That doesn't really seem to exist either (and even if it did, I'm not sure how I would work the timing on mouth movement to audio).

Its 2015, I feel like something like this should already exist and I shouldn't be trying to invent it.

Edit: Now I'm looking into Microsft.Speech. I really need something that spits out something like IPA in syllables and I'm not sure if MS.Speech does that. TTS wave creation is the easy part. I could send text to the server, match phonetic syllables to mouth point coordinates... if I could just get those syllables broken out. What breaks text into phonetic syllables.

like image 901
user2245759 Avatar asked Mar 05 '15 17:03

user2245759


2 Answers

You want to look at the Speech Synthesis API. The most basic use is:

var msg = new SpeechSynthesisUtterance('Hello World');
window.speechSynthesis.speak(msg);

http://updates.html5rocks.com/2014/01/Web-apps-that-talk---Introduction-to-the-Speech-Synthesis-API

https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#tts-section

Here is browser support: http://caniuse.com/web-speech. At the moment only Chrome & Safari support it.

like image 135
rickyduck Avatar answered Oct 01 '22 00:10

rickyduck


I think I have an approach. In short, no, there does not appear to be an existing utility... Yet ;-)

I've decide to go with the Microsoft Speech Platform. It does better than return phonemes, it provides the accompanying viseme IDs with the audio position at which they occur. So I can generate a wav file and a viseme meta-data list server-side and retrieve them. Now to figure out how to synchronize them.

like image 44
user2245759 Avatar answered Oct 01 '22 00:10

user2245759