Here is a sample of WebVTT
WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
}
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
}
##
00:00:00.060 --> 00:00:03.080 align:start position:0%
<c.colorE5E5E5>okay<00:00:00.690><c> so</c><00:00:00.750><c> this</c><00:00:01.319><c> is</c><00:00:01.469><c> a</c></c><c.colorCCCCCC><00:00:01.500><c> newsflash</c><00:00:02.040><c> page</c><00:00:02.460><c> for</c></c>
00:00:03.080 --> 00:00:03.090 align:start position:0%
<c.colorE5E5E5>okay so this is a</c><c.colorCCCCCC> newsflash page for
</c>
00:00:03.090 --> 00:00:08.360 align:start position:0%
<c.colorE5E5E5>okay so this is a</c><c.colorCCCCCC> newsflash page for</c>
<c.colorE5E5E5>Meraki<00:00:03.659><c> printing</c><00:00:05.120><c> so</c><00:00:06.529><c> all</c><00:00:07.529><c> we</c><00:00:08.040><c> need</c><00:00:08.130><c> to</c><00:00:08.189><c> do</c></c>
00:00:08.360 --> 00:00:08.370 align:start position:0%
<c.colorE5E5E5>Meraki printing so all we need to do
</c>
00:00:08.370 --> 00:00:11.749 align:start position:0%
<c.colorE5E5E5>Meraki printing so all we need to do
here<00:00:08.700><c> is</c><00:00:08.820><c> to</c><00:00:09.000><c> swap</c><00:00:09.330><c> out</c><00:00:09.480><c> the</c><00:00:09.660><c> logo</c><00:00:09.929><c> here</c><00:00:10.650><c> and</c><00:00:10.830><c> I</c></c>
00:00:11.749 --> 00:00:11.759 align:start position:0%
here is to swap out the logo here<c.colorE5E5E5> and I
</c>
00:00:11.759 --> 00:00:16.400 align:start position:0%
here is to swap out the logo here<c.colorE5E5E5> and I
should<00:00:11.969><c> also</c><00:00:12.120><c> work</c><00:00:12.420><c> on</c><00:00:12.630><c> move</c><00:00:12.840><c> out</c><00:00:13.049><c> as</c><00:00:13.230><c> well</c><00:00:15.410><c> and</c></c>
00:00:16.400 --> 00:00:16.410 align:start position:0%
<c.colorE5E5E5>should also work on move out as well and
</c>
I used youtube-dl to grab it from YouTube.
I want to convert this to plain text. I can't just strip out the times and colour tags as the text repeats itself .
So I'm wondering if something exists to convert this to plain text or if there is some pseudo code someone could offer so I could code that up?
I have also posted an issue about this with youtube-dl.
Command line in bash shell works best for me, being faster, smaller, simpler, effective:
cat myfile.vtt | grep : -v | awk '!seen[$0]++'
This grep removes lines that contain : (colon) by using -v to invert aka not contain :
This awk removes duplicate lines.
I've used WebVTT-py to extract the plain text transcription.
import webvtt
vtt = webvtt.read('subtitles.vtt')
transcript = ""
lines = []
for line in vtt:
# Strip the newlines from the end of the text.
# Split the string if it has a newline in the middle
# Add the lines to an array
lines.extend(line.text.strip().splitlines())
# Remove repeated lines
previous = None
for line in lines:
if line == previous:
continue
transcript += " " + line
previous = line
print(transcript)
Same concept as in Terence Eden's answer but generalized into single functions. The magic of generators improves readability for this task and saves a lot of memory. There's often no need to hold data from files in lists or big strings for processing. So at least webvtt is the here the only part keeping the whole source file in memory.
I found whitespace html entities in my files too so there's a simple replace added. And I made it default to keep the line breaks by default.
This is my version containing pathlib, Typing and Generators:
from pathlib import Path
from typing import Generator
import webvtt
def vtt_lines(src) -> Generator[str, None, None]:
"""
Extracts all text lines from a vtt file which may contain duplicates
:param src: File path or file like object
:return: Generator for lines as strings
"""
vtt = webvtt.read(src)
for caption in vtt: # type: webvtt.structures.Caption
# A caption which may contain multiple lines
for line in caption.text.strip().splitlines(): # type: str
# Process each one of them
yield line
def deduplicated_lines(lines) -> Generator[str, None, None]:
"""
Filters all duplicated lines from list or generator
:param lines: iterable or generator of stringsa
:return: Generator for lines as strings without duplicates
"""
last_line = ""
for line in lines:
if line == last_line:
continue
last_line = line
yield line
def vtt_to_linear_text(src, savefile: Path, line_end="\n"):
"""
Converts an vtt caption file to linear text.
:param src: Path or path like object to an existing vtt file
:param savefile: Path object to save content in
:param line_end: Default to line break. May be set to a space for a single line output.
"""
with savefile.open("w") as writer:
for line in deduplicated_lines(vtt_lines(src)):
writer.write(line.replace(" ", " ").strip() + line_end)
# Demo call
vtt_to_linear_text("file.vtt", Path("file.txt"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With