Web Speech API specification says:
text attribute
This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document. For speech synthesis engines that do not support SSML, or only support certain tags, the user agent or speech engine must strip away the tags they do not support and speak the text.
It does not provide an example of using text
with an SSML document.
I tried the following in Chrome 33:
var msg = new SpeechSynthesisUtterance();
msg.text = '<?xml version="1.0"?>\r\n<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">ABCD</speak>';
speechSynthesis.speak(msg);
It did not work -- the voice attempted to narrate the XML tags. Is this code valid?
Do I have to provide a XMLDocument
object instead?
I am trying to understand whether Chrome violates the specification (which should be reported as a bug), or whether my code is invalid.
The speak element is the root element. It's required for all SSML documents. The speak element contains important information, such as version, language, and the markup vocabulary definition.
SSML is a markup language that provides a standard way to mark up text for the generation of synthetic speech. The Alexa Skills Kit supports a subset of the tags defined in the SSML specification. The specific tags supported are listed in Supported SSML Tags.
The SpeechSynthesis interface of the Web Speech API is the controller interface for the speech service; this can be used to retrieve information about the synthesis voices available on the device, start and pause speech, and other commands besides. EventTarget SpeechSynthesis.
In Chrome 46, the XML is being interpreted properly as an XML document, on Windows, when the language is set to en
; however, I see no evidence that the tags are actually doing anything. I heard no difference between the <emphasis>
and non-<emphasis>
versions of this SSML:
var msg = new SpeechSynthesisUtterance();
msg.text = '<?xml version="1.0"?>\r\n<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"><emphasis>Welcome</emphasis> to the Bird Seed Emporium. Welcome to the Bird Seed Emporium.</speak>';
msg.lang = 'en';
speechSynthesis.speak(msg);
The <phoneme>
tag was also completely ignored, which made my attempt to speak IPA fail.
var msg = new SpeechSynthesisUtterance();
msg.text='<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> Pavlova is a meringue-based dessert named after the Russian ballerina Anna Pavlova. It is a meringue cake with a crisp crust and soft, light inside, usually topped with fruit and, optionally, whipped cream. The name is pronounced <phoneme alphabet="ipa" ph="pævˈloʊvə">...</phoneme> or <phoneme alphabet="ipa" ph="pɑːvˈloʊvə">...</phoneme>, unlike the name of the dancer, which was <phoneme alphabet="ipa" ph="ˈpɑːvləvə">...</phoneme> </speak>';
msg.lang = 'en';
speechSynthesis.speak(msg);
This is despite the fact that the Microsoft speech API does handle SSML correctly. Here is a C# snippet, suitable for use in LinqPad:
var str = "Pavlova is a meringue-based dessert named after the Russian ballerina Anna Pavlova. It is a meringue cake with a crisp crust and soft, light inside, usually topped with fruit and, optionally, whipped cream. The name is pronounced /pævˈloʊvə/ or /pɑːvˈloʊvə/, unlike the name of the dancer, which was /ˈpɑːvləvə/.";
var regex = new Regex("/([^/]+)/");
if (regex.IsMatch(str))
{
str = regex.Replace(str, "<phoneme alphabet=\"ipa\" ph=\"$1\">word</phoneme>");
str.Dump();
}
SpeechSynthesizer synth = new SpeechSynthesizer();
PromptBuilder pb = new PromptBuilder();
pb.AppendSsmlMarkup(str);
synth.Speak(pb);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With