I am trying extract some information from the below given string
>>> st = '''
... <!-- info mp3 here -->
... 192 kbps<br />2:41<br />3.71 mb </div>
... <!-- info mp3 here -->
... 3.49 mb </div>
... <!-- info mp3 here -->
... 128 kbps<br />3:31<br />3.3 mb </div>
... '''
>>>
Now when I use the below regex my output is
>>> p = re.findall(r'<!-- info mp3 here -->\s+(.*?)<br />(.*?)<br />(.*?)\s+</div>',st)
>>> p
[('192 kbps', '2:41', '3.71 mb'), ('128 kbps', '3:31', '3.3 mb')]
but my required output is
[('192 kbps', '2:41', '3.71 mb'),(None,None,'3.49mb'), ('128 kbps', '3:31', '3.3 mb')]
So, my question is how do I change my above regex
to match all the conditions.I believe my current regex is strictly dependent on <br />
tags so how do I make it conditional on that.
I know I should not be using regex to parse html but currently this is the most appropriate way for me.
The following will work, though I wonder if there's not a more elegant solution. You can certainly combine the list comprehensions into one line, but I think that makes the code less clear overall. At least this way you'll be able to follow what you did three months from now...
st = '''
<!-- info mp3 here -->
192 kbps<br />2:41<br />3.71 mb </div>
<!-- info mp3 here -->
3.49 mb </div>
<!-- info mp3 here -->
128 kbps<br />3:31<br />3.3 mb </div>
'''
p = re.findall(r'<!-- info mp3 here -->\s+(.*?)\s+</div>',st)
p2 = [row.split('<br />') for row in p]
p3 = [[None]*(3 - len(row)) + row for row in p2]
>>> p3
[['192 kbps', '2:41', '3.71 mb'], [None, None, '3.49 mb'], ['128 kbps', '3:31', '3.3 mb']]
And, depending on the variability in your string, you may want to write a more generic cleaning function that strips, cases, whatever, and map it to each item you pull out.
Here's a regex solution that works by being a bit more specific. I'm not sure this is preferable to Karmel's answer, but I figured I'd answer the question as asked. Instead of returning None
, the first two optional groups return the empty string ''
, which I think is probably close enough.
Note the nested group structure. The first two outer groups are optional, but the <br />
tag is required for them to match. That way, if there are fewer than two <br />
tags, the last item doesn't match until the end:
rx = r'''<!--\ info\ mp3\ here\ -->\s+ # verbose mode; escape literal spaces
(?: # outer non-capturing group
([^<>]*) # inner capturing group without <>
(?:<br\ />) # inner non-capturing group matching br
)? # whole outer group is optional
(?:
([^<>]*) # all same as above
(?:<br\ />)
)?
(?: # outer non-capturing group
(.*?) # non-greedy wildcard match
(?:\s+</div>) # inner non-capturing group matching div
)''' # final group is not optional
Tested:
>>> re.findall(rx, st, re.VERBOSE)
[('192 kbps', '2:41', '3.71 mb'),
('', '', '3.49 mb'),
('128 kbps', '3:31', '3.3 mb')]
Note the re.VERBOSE
flag, which is necessary unless you remove all the whitespace and comments above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With