Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional matching in regular expression

Tags:

python

regex

I am trying extract some information from the below given string

>>> st = '''
... <!-- info mp3 here -->
...                             192 kbps<br />2:41<br />3.71 mb  </div>
... <!-- info mp3 here -->
...                             3.49 mb  </div>
... <!-- info mp3 here -->
...                             128 kbps<br />3:31<br />3.3 mb   </div>
... '''
>>>

Now when I use the below regex my output is

>>> p = re.findall(r'<!-- info mp3 here -->\s+(.*?)<br />(.*?)<br />(.*?)\s+</div>',st)
>>> p
[('192 kbps', '2:41', '3.71 mb'), ('128 kbps', '3:31', '3.3 mb')]

but my required output is

[('192 kbps', '2:41', '3.71 mb'),(None,None,'3.49mb'), ('128 kbps', '3:31', '3.3 mb')]

So, my question is how do I change my above regex to match all the conditions.I believe my current regex is strictly dependent on <br /> tags so how do I make it conditional on that.

I know I should not be using regex to parse html but currently this is the most appropriate way for me.

like image 277
RanRag Avatar asked Feb 20 '23 09:02

RanRag


2 Answers

The following will work, though I wonder if there's not a more elegant solution. You can certainly combine the list comprehensions into one line, but I think that makes the code less clear overall. At least this way you'll be able to follow what you did three months from now...

st = '''
<!-- info mp3 here -->
                            192 kbps<br />2:41<br />3.71 mb  </div>
<!-- info mp3 here -->
                            3.49 mb  </div>
<!-- info mp3 here -->
                            128 kbps<br />3:31<br />3.3 mb   </div>
'''

p = re.findall(r'<!-- info mp3 here -->\s+(.*?)\s+</div>',st)
p2 = [row.split('<br />') for row in p]
p3 = [[None]*(3 - len(row)) + row for row in p2]

>>> p3
[['192 kbps', '2:41', '3.71 mb'], [None, None, '3.49 mb'], ['128 kbps', '3:31', '3.3 mb']]

And, depending on the variability in your string, you may want to write a more generic cleaning function that strips, cases, whatever, and map it to each item you pull out.

like image 80
Karmel Avatar answered Feb 25 '23 16:02

Karmel


Here's a regex solution that works by being a bit more specific. I'm not sure this is preferable to Karmel's answer, but I figured I'd answer the question as asked. Instead of returning None, the first two optional groups return the empty string '', which I think is probably close enough.

Note the nested group structure. The first two outer groups are optional, but the <br /> tag is required for them to match. That way, if there are fewer than two <br /> tags, the last item doesn't match until the end:

rx = r'''<!--\ info\ mp3\ here\ -->\s+   # verbose mode; escape literal spaces
         (?:                             # outer non-capturing group  
            ([^<>]*)                     # inner capturing group without <>
            (?:<br\ />)                  # inner non-capturing group matching br
         )?                              # whole outer group is optional
         (?:                             
            ([^<>]*)                     # all same as above
            (?:<br\ />)                
         )?
         (?:                             # outer non-capturing group
            (.*?)                        # non-greedy wildcard match
            (?:\s+</div>)                # inner non-capturing group matching div
         )'''                            # final group is not optional

Tested:

>>> re.findall(rx, st, re.VERBOSE)
[('192 kbps', '2:41', '3.71 mb'), 
 ('', '', '3.49 mb'), 
 ('128 kbps', '3:31', '3.3 mb')]

Note the re.VERBOSE flag, which is necessary unless you remove all the whitespace and comments above.

like image 29
senderle Avatar answered Feb 25 '23 15:02

senderle