I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow".
import re
response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''
r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL)
for m in r.finditer(response):
print m.group(0), "\n"
My output is
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
While my aim is to get
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work.
Does anybody have suggestions?
I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se.
To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from
response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1'''
the output
0randomstuffB3randomstuff1
0randomstuffB4randomstuff1
0randomstuffB5randomstuff1
and randomstuff should be interpreted as random strings (but not containing 0 or 1).
A non-greedy match means that the regex engine matches as few characters as possible—so that it still can match the pattern in the given string.
It means the greedy quantifiers will match their preceding elements as much as possible to return to the biggest match possible. On the other hand, the non-greedy quantifiers will match as little as possible to return the smallest match possible.
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" .
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient. Instead of matching till the first occurrence of '>', it extracted the whole string.
Your problem isn't related to the greediness but to the fact that the regex engine tries to succeed at each position in the string from left to right. That's why you will always obtain the leftmost result and using a non-greedy quantifier will not change the starting position!
If you write something like: <tr.*?cow.*?tr>
or 0.*?B.*?1
(for your second example) the patterns are first tried:
<tr class="someClass"><td></td><td>chicken</td></tr>...
# ^-----here
# or
0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3ra...
# ^-----here
And the first .*?
will eat characters until "cow" or "B". Result, the first match is:
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
for your first example, and:
0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff1
for the second.
To obtain what you want, you need to make the patterns fail at unwanted positions in the string. To do that .*?
is useless because too permissive.
You can for instance forbid a </tr>
or a 1
to occur before "cow" or "B".
# easy to write but not very efficient (with DOTALL)
<tr\b(?:(?!</tr>).)*?cow.*?</tr>
# more efficient
<tr\b[^<c]*(?:<(?!/tr>)[^<c]*|c(?!ow)[^<c]*)*cow.*?</tr>
# easier to write when boundaries are single characters
0[^01B]*B[^01]*1
If the input string contains each tag on a separate line, Moses Koledoye's answer would work.
However, if the tags are spread out over multiple lines, the following would be needed:
import re
response = '''
<tr class="someClass
"><td></td><td>chicken</td></tr><tr class="someClass"><td></td><td>chic
ken</td></tr><tr class="someClass"><td></td><td>cow</td></tr><tr class="someC
lass"><td></td><td>cow</td></tr><tr
class="someClass"><td></td><td>c
ow
</td></tr>
'''
# Remove all the newlines
# Required only if words like 'cow' and '<tr' are split between 2 lines
response = response.replace('\n', '')
r1 = re.compile(r'<tr.*?tr>', re.DOTALL)
r2 = re.compile(r'.*cow.*', re.DOTALL)
for m in r1.finditer(response):
n = r2.match(m.group())
if n:
print n.group(), '\n'
Note that this would work even if the tags were on separate lines as shown in the example string you provided, so this is a more general solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With