I want to parse srt subtitles: <pre class="prettyprint"><code> 1 00:00:12,815 --> 00:00:14,509 Chlapi, jak to jde s těma pracovníma světlama?. 2 00:00:14,815 --> 00:00:16,498 Trochu je zesilujeme. 3 00:00:16,934 --> 00:00:17,814 Jo, sleduj. </code></pre> Every item into structure. With this regexs: A: <pre class="prettyprint"><code>RE_ITEM = re.compile(r'(?P<index>\d+).' r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> ' r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).' r'(?P<text>.*?)', re.DOTALL) </code></pre> B: <pre class="prettyprint"><code>RE_ITEM = re.compile(r'(?P<index>\d+).' r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> ' r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).' r'(?P<text>.*)', re.DOTALL) </code></pre> And this code: <pre class="prettyprint"><code> for i in Subtitles.RE_ITEM.finditer(text): result.append((i.group('index'), i.group('start'), i.group('end'), i.group('text'))) </code></pre> With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*? How to cure this? Thanks

I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt. I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones. Here's a usage example with your sample input: <pre class="prettyprint"><code>>>> import srt, pprint >>> gen = srt.parse('''\ ... 1 ... 00:00:12,815 --> 00:00:14,509 ... Chlapi, jak to jde s ... těma pracovníma světlama?. ... ... 2 ... 00:00:14,815 --> 00:00:16,498 ... Trochu je zesilujeme. ... ... 3 ... 00:00:16,934 --> 00:00:17,814 ... Jo, sleduj. ... ... ''') >>> pprint.pprint(list(gen)) [Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'), Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'), Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')] </code></pre>

Parsing srt subtitles

Tags:

python

regex

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*?)', re.DOTALL)

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*)', re.DOTALL)

And this code:

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?

How to cure this?

Thanks

267

asked Apr 11 '10 10:04

Vojta Rylko

2 Answers

Why not use pysrt?

130

answered Oct 01 '22 22:10

John La Rooy

I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

Here's a usage example with your sample input:

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]

answered Oct 01 '22 22:10

Chris Down

Related questions
                            
                                defaultdict and tuples
                            
                                Python convert Tuple to Integer
                            
                                Ignoring output from subprocess.Popen [duplicate]
                            
                                SyntaxError: multiple statements found while compiling a single statement
                            
                                Unknown command: shell_plus and --settings
                            
                                How to perform ceiling-division in integer arithmetic? [duplicate]
                            
                                how to extract x,y coordinates from OpenCV "cv2.keypoint" object?
                            
                                Add 1 day to my date in Python [duplicate]
                            
                                map values in a dataframe from a dictionary using pyspark
                            
                                Python: Variance of a list of defined numbers
                            
                                A replacement for python's httplib?
                            
                                How to increment a value with leading zeroes?
                            
                                SQLite parameter substitution and quotes
                            
                                What does '_' do in Django code?
                            
                                Python unhash value
                            
                                easy_install fails on error "Couldn't find setup script" after binary upload?
                            
                                How to encode (utf8mb4) in Python
                            
                                No module named 'allauth.account.context_processors'
                            
                                Running Python scripts with Xampp
                            
                                Tool to convert python indentation from spaces to tabs? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With