Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing srt subtitles

Tags:

python

regex

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

A:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*?)', re.DOTALL)

B:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*)', re.DOTALL)

And this code:

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?

How to cure this?

Thanks

like image 267
Vojta Rylko Avatar asked Apr 11 '10 10:04

Vojta Rylko


People also ask

Are SRT files closed captions?

An SRT file or SubRip (. srt) file is one of the most common types of raw closed caption or subtitle file formats.

How do SRT files work?

An SRT file, (aka a SubRip Subtitle file), is a plain text file of your subtitles that attaches to your video. In addition the the script, or subtitles, of your video, an SRT file also contains timestamps, so the subtitles match the audio exactly.


2 Answers

Why not use pysrt?

like image 130
John La Rooy Avatar answered Oct 01 '22 22:10

John La Rooy


I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

Here's a usage example with your sample input:

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]
like image 23
Chris Down Avatar answered Oct 01 '22 22:10

Chris Down