In order to generate subtitles for my videos, I converted them to audio files and used the Cloud Speech-to-Text. It works, but it only generates transcriptions, whereas what I need is a *.srt
/*.vtt
/similar file.
What I need is what YouTube does: to generate transcriptions and sync them with the video, like a subtitle format, ie.: transcriptions with the times when captions should appear.
Although I could upload them to YouTube and then download their auto-generated captions, it doesn't seem very correct.
Is there a way to generate an SRT file (or similar) using Google Cloud Speech?
There's no way really to do this directly from the Speech-to-Text API. What you could try to do is some post-processing on the speech recognition result.
For example, here's a request to the REST API using a model meant to transcribe video, with a public google-provided sample file:
curl -s -H "Content-Type: application/json" \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize \
--data "{
'config': {
'encoding': 'LINEAR16',
'sampleRateHertz': 16000,
'languageCode': 'en-US',
'enableWordTimeOffsets': true,
'enableAutomaticPunctuation': true,
'model': 'video'
},
'audio': {
'uri':'gs://cloud-samples-tests/speech/Google_Gnome.wav'
}
}"
The above uses asynchronous recognition (speech:longrunningrecognize
), which is more fitting for larger files. Enabling punctuation ('enableAutomaticPunctuation': true
) in combination with the start and end times of words ('enableWordTimeOffsets': true
) near the start and end of each sentence (which you'd also have to convert from nanos to timestamps) could allow you to provide a text file in the srt format. You would probably also have to include some rules about the maximum length of a sentence appearing on the screen at any given time.
The above should not be too difficult to implement, however, there's a strong possibility that you would still encounter timing/synchronization issues.
here is the code I used
import math
import json
import datetime
def to_hms(s):
m, s = divmod(s, 60)
h, m = divmod(m, 60)
return '{}:{:0>2}:{:0>2}'.format(h, m, s)
def srt_generation(filepath, filename):
filename = 'DL_BIRTHDAY'
with open('{}{}.json'.format(filepath, filename), 'r') as file:
data = file.read()
results = json.loads(data)['response']['annotationResults'][0]['speechTranscriptions']
processed_results = []
counter = 1
lines = []
wordlist = []
for transcription in results:
alternative = transcription['alternatives'][0]
if alternative.has_key('transcript'):
# print(counter)
# lines.append(counter)
tsc = alternative['transcript']
stime = alternative['words'][0]['startTime'].replace('s','').split('.')
etime = alternative['words'][-1]['endTime'].replace('s','').split('.')
if(len(stime) == 1):
stime.append('000')
if(len(etime) == 1):
etime.append('000')
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(counter, to_hms(int(stime[0])), stime[1], to_hms(int(etime[0])), etime[1],tsc.encode('ascii', 'ignore')))
counter = counter+1
wordlist.extend(alternative['words'])
srtfile = open('{}{}.srt'.format(filepath, filename), 'wr')
srtfile.writelines(lines)
srtfile.close()
## Now generate 3 seconds duration chunks of those words.
lines = []
counter = 1
strtime =0
entime = 0
words = []
standardDuration = 3
srtcounter = 1
for word in wordlist:
stime = word['startTime'].replace('s','').split('.')
etime = word['endTime'].replace('s','').split('.')
if(len(stime) == 1):
stime.append('000 ')
if(len(etime) == 1):
etime.append('000')
if(counter == 1):
strtime = '{},{}'.format(stime[0], stime[1])
entime = '{},{}'.format(etime[0], etime[1])
words.append(word['word'])
else:
tempstmime = int(stime[0])
tempentime = int(etime[0])
stimearr = strtime.split(',')
etimearr = entime.split(',')
if(tempentime - int(strtime.split(',')[0]) > standardDuration ):
transcript = ' '.join(words)
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(srtcounter, to_hms(int(stimearr[0])), stimearr[1], to_hms(int(etimearr[0])), etimearr[1],transcript.encode('ascii', 'ignore')))
srtcounter = srtcounter+1
words = []
strtime = '{},{}'.format(stime[0], stime[1])
entime = '{},{}'.format(etime[0], etime[1])
words.append(' ')
words.append(word['word'])
else:
words.append(' ')
words.append(word['word'])
entime = '{},{}'.format(etime[0], etime[1])
counter = counter +1
if(len(words) > 0):
tscp = ' '.join(words)
stimearr = strtime.split(',')
etimearr = entime.split(',')
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(srtcounter, to_hms(int(stimearr[0])), stimearr[1], to_hms(int(etimearr[0])), etimearr[1],tscp.encode('ascii', 'ignore')))
srtfile = open('{}{}_3_Sec_Custom.srt'.format(filepath, filename), 'wr')
srtfile.writelines(lines)
srtfile.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With