Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get SSML <mark> timestamps from Google Cloud text-to-speech API

I want to use SSML markers through the Google Cloud text-to-speech API to request the timing of these markers in the audio stream. These timestamps are necessary in order to provide cues for effects, word/section highlighting and feedback to the user.

I found this question which is relevant, although the question refers to the timestamps for each word and not the SSML <mark> tag.

The following API request returns OK but shows the lack of the requested marker data. This is using the Cloud Text-to-Speech API v1.

{
 "voice": {
  "languageCode": "en-US"
 },
 "input": {
  "ssml": "<speak>First, <mark name=\"a\"/> second, <mark name=\"b\"/> third.</speak>"
 },
 "audioConfig": {
  "audioEncoding": "mp3"
 }
} 

Response:

{
 "audioContent":"//NExAAAAANIAAAAABcFAThYGJqMWA..."
}

Which only provides the synthesized audio without any contextual information.

Is there an API request that I am overlooking which can expose information about these markers such as is the case with IBM Watson and Amazon Polly?

like image 787
James Avatar asked Aug 06 '19 18:08

James


People also ask

What Speech Synthesis Markup Language SSML element should you use?

The speak element is the root element. It's required for all SSML documents. The speak element contains important information, such as version, language, and the markup vocabulary definition.

What is SSML syntax?

<speak> <break> <say‑as> <audio> <p>,<s>

What is the purpose of Speech Synthesis Markup Language SSML in Amazon Polly?

Speech Synthesis Markup Language (SSML) is a standardized markup language that enables developers to modify Text-to-Speech (TTS) audio. With SSML, you can control various vocal characteristics of TTS output, such as pronunciation, speech rate, and other elements, to produce a more natural-sounding voice experience.


Video Answer


2 Answers

At the time of writing, the timepoint data is available in the v1beta1 release of Google cloud text-to-speech.

I didn't need to sign on to any extra developer program in order to access the beta, beyond the default access.

Importing in Python (for example) went from:

from google.cloud import texttospeech as tts

to:

from google.cloud import texttospeech_v1beta1 as tts

Nice and simple.

I needed to modify the default way I was sending the synthesis request to include the enable_time_pointing flag.

I found that with a mix of poking around the machine-readable API description here and reading the Python library code, which I had already downloaded.

Thankfully, the source in the generally available version also includes the v1beta version - thank you Google!

I've put a runnable sample below. Running this needs the same auth and setup you'll need for a general text-to-speech sample, which you can get by following the official documentation.

Here's what it does for me (with slight formatting for readability):

$ python tools/try-marks.py
Marks content written to file: .../demo.json
Audio content written to file: .../demo.mp3

$ cat demo.json
[
  {"sec": 0.4300000071525574, "name": "here"},
  {"sec": 0.9234582781791687, "name": "there"}
]

Here's the sample:

import json
from pathlib import Path
from google.cloud import texttospeech_v1beta1 as tts


def go_ssml(basename: Path, ssml):
    client = tts.TextToSpeechClient()
    voice = tts.VoiceSelectionParams(
        language_code="en-AU",
        name="en-AU-Wavenet-B",
        ssml_gender=tts.SsmlVoiceGender.MALE,
    )

    response = client.synthesize_speech(
        request=tts.SynthesizeSpeechRequest(
            input=tts.SynthesisInput(ssml=ssml),
            voice=voice,
            audio_config=tts.AudioConfig(audio_encoding=tts.AudioEncoding.MP3),
            enable_time_pointing=[
                tts.SynthesizeSpeechRequest.TimepointType.SSML_MARK]
        )
    )

    # cheesy conversion of array of Timepoint proto.Message objects into plain-old data
    marks = [dict(sec=t.time_seconds, name=t.mark_name)
             for t in response.timepoints]

    name = basename.with_suffix('.json')
    with name.open('w') as out:
        json.dump(marks, out)
        print(f'Marks content written to file: {name}')

    name = basename.with_suffix('.mp3')
    with name.open('wb') as out:
        out.write(response.audio_content)
        print(f'Audio content written to file: {name}')


go_ssml(Path.cwd() / 'demo', """
    <speak>
    Go from <mark name="here"/> here, to <mark name="there"/> there!
    </speak>
    """)
like image 122
Andrew E Avatar answered Oct 11 '22 12:10

Andrew E


Looks like this is supported in Cloud Text-to-Speech API v1beta1: https://cloud.google.com/text-to-speech/docs/reference/rest/v1beta1/text/synthesize#TimepointType

You can use https://texttospeech.googleapis.com/v1beta1/text:synthesize. Set TimepointType to SSML_MARK. If this field is not set, timepoints are not returned by default.

like image 4
i_am_momo Avatar answered Oct 11 '22 12:10

i_am_momo