I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow". <pre class="prettyprint"><code>import re response = ''' <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>cow</td></tr> <tr class="someClass"><td></td><td>cow</td></tr> <tr class="someClass"><td></td><td>cow</td></tr> ''' r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL) for m in r.finditer(response): print m.group(0), "\n" </code></pre> My output is <code><tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>cow</td></tr></code> <code><tr class="someClass"><td></td><td>cow</td></tr></code> <code><tr class="someClass"><td></td><td>cow</td></tr></code> While my aim is to get <code><tr class="someClass"><td></td><td>cow</td></tr></code> <code><tr class="someClass"><td></td><td>cow</td></tr></code> <code><tr class="someClass"><td></td><td>cow</td></tr></code> I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work. Does anybody have suggestions? I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se. To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from <pre class="prettyprint"><code>response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1''' </code></pre> the output <pre class="prettyprint"><code>0randomstuffB3randomstuff1 0randomstuffB4randomstuff1 0randomstuffB5randomstuff1 </code></pre> and randomstuff should be interpreted as random strings (but not containing 0 or 1).

Your problem isn't related to the greediness but to the fact that the regex engine tries to succeed at each position in the string from left to right. That's why you will always obtain the leftmost result and using a non-greedy quantifier will not change the starting position! If you write something like: <code><tr.*?cow.*?tr></code> or <code>0.*?B.*?1</code> (for your second example) the patterns are first tried: <pre class="prettyprint"><code> <tr class="someClass"><td></td><td>chicken</td></tr>... # ^-----here # or 0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3ra... # ^-----here </code></pre> And the first <code>.*?</code> will eat characters until "cow" or "B". Result, the first match is: <pre class="prettyprint"><code><tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>cow</td></tr> </code></pre> for your first example, and: <pre class="prettyprint"><code>0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff1 </code></pre> for the second. <hr> To obtain what you want, you need to make the patterns fail at unwanted positions in the string. To do that <code>.*?</code> is useless because too permissive. You can for instance forbid a <code></tr></code> or a <code>1</code> to occur before "cow" or "B". <pre class="prettyprint"><code># easy to write but not very efficient (with DOTALL) <tr\b(?:(?!</tr>).)*?cow.*?</tr> # more efficient <tr\b[^<c]*(?:<(?!/tr>)[^<c]*|c(?!ow)[^<c]*)*cow.*?</tr> # easier to write when boundaries are single characters 0[^01B]*B[^01]*1 </code></pre>

If the input string contains each tag on a separate line, Moses Koledoye's answer would work. However, if the tags are spread out over multiple lines, the following would be needed: <pre class="prettyprint"><code>import re response = ''' <tr class="someClass "><td></td><td>chicken</td></tr><tr class="someClass"><td></td><td>chic ken</td></tr><tr class="someClass"><td></td><td>cow</td></tr><tr class="someC lass"><td></td><td>cow</td></tr><tr class="someClass"><td></td><td>c ow </td></tr> ''' # Remove all the newlines # Required only if words like 'cow' and '<tr' are split between 2 lines response = response.replace('\n', '') r1 = re.compile(r'<tr.*?tr>', re.DOTALL) r2 = re.compile(r'.*cow.*', re.DOTALL) for m in r1.finditer(response): n = r2.match(m.group()) if n: print n.group(), '\n' </code></pre> Note that this would work even if the tags were on separate lines as shown in the example string you provided, so this is a more general solution.

Complex non-greedy matching with regular expressions

Tags:

python

regex

non-greedy

I'm trying to parse rows from a HTML table with cells containing specific values with regular expressions in Python. My aim in this (contrived) example is to get the rows with "cow".

import re

response = '''
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>
'''

r = re.compile(r'<tr.*?cow.*?tr>', re.DOTALL)

for m in r.finditer(response):
  print m.group(0), "\n"

My output is

<tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>chicken</td></tr> <tr class="someClass"><td></td><td>cow</td></tr>

<tr class="someClass"><td></td><td>cow</td></tr>

While my aim is to get

<tr class="someClass"><td></td><td>cow</td></tr>

I understand that the non-greedy ? doesn't work in this case because of how backtracking works. I fiddled around with negative lookbehinds and lookahead but can't get it to work.

Does anybody have suggestions?

I'm aware of solutions like Beautiful Soup, etc. but the question is about understanding regular expressions, not the problem per se.

To address concerns of people about not using regular expressions for HTML. The general problem I want to solve using regular expressions ONLY is to get from

response = '''0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff10randomstuffB4randomstuff10randomstuffB5randomstuff1'''

the output

0randomstuffB3randomstuff1 

0randomstuffB4randomstuff1 

0randomstuffB5randomstuff1

and randomstuff should be interpreted as random strings (but not containing 0 or 1).

385

asked Jun 08 '16 08:06

user2940666

2 Answers

Your problem isn't related to the greediness but to the fact that the regex engine tries to succeed at each position in the string from left to right. That's why you will always obtain the leftmost result and using a non-greedy quantifier will not change the starting position!

If you write something like: <tr.*?cow.*?tr> or 0.*?B.*?1 (for your second example) the patterns are first tried:

  <tr class="someClass"><td></td><td>chicken</td></tr>...
# ^-----here

# or

  0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3ra...
# ^-----here

And the first .*? will eat characters until "cow" or "B". Result, the first match is:

<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>chicken</td></tr>
<tr class="someClass"><td></td><td>cow</td></tr>

for your first example, and:

0randomstuffA1randomstuff10randomstuffA2randomstuff10randomstuffB3randomstuff1

for the second.

To obtain what you want, you need to make the patterns fail at unwanted positions in the string. To do that .*? is useless because too permissive.

You can for instance forbid a </tr> or a 1 to occur before "cow" or "B".

# easy to write but not very efficient (with DOTALL)
<tr\b(?:(?!</tr>).)*?cow.*?</tr>

# more efficient
<tr\b[^<c]*(?:<(?!/tr>)[^<c]*|c(?!ow)[^<c]*)*cow.*?</tr>

# easier to write when boundaries are single characters
0[^01B]*B[^01]*1

answered Sep 30 '22 00:09

Casimir et Hippolyte

If the input string contains each tag on a separate line, Moses Koledoye's answer would work.
However, if the tags are spread out over multiple lines, the following would be needed:

import re


response = '''
<tr class="someClass
"><td></td><td>chicken</td></tr><tr class="someClass"><td></td><td>chic
ken</td></tr><tr class="someClass"><td></td><td>cow</td></tr><tr class="someC
lass"><td></td><td>cow</td></tr><tr
class="someClass"><td></td><td>c
ow
</td></tr>
'''


# Remove all the newlines
# Required only if words like 'cow' and '<tr' are split between 2 lines
response = response.replace('\n', '')

r1 = re.compile(r'<tr.*?tr>', re.DOTALL)
r2 = re.compile(r'.*cow.*', re.DOTALL)

for m in r1.finditer(response):
    n = r2.match(m.group())
    if n:
        print n.group(), '\n'

Note that this would work even if the tags were on separate lines as shown in the example string you provided, so this is a more general solution.

answered Sep 30 '22 01:09

Anmol Singh Jaggi

Related questions
                            
                                Bad file descriptor in Python 2.7
                            
                                How can I use mock_open with a Python UnitTest decorator?
                            
                                Anonym password protect pages without username with Flask
                            
                                Virtual Environments: python -m venv VS echo layout python3
                            
                                How can one mark a flag as required with gflags?
                            
                                Download azure blob via stream - Exit 137
                            
                                How to scan for a string literal allowing escaped characters?
                            
                                Is it possible to trigger a mousePressEvent artificially on a QWebView?
                            
                                Determinate if class has user defined __init__
                            
                                How can I declare a Column as a categorical feature in a DataFrame for use in ml
                            
                                What does ${python3:Depends} mean in a debian source-package control file?
                            
                                attributeError: can't set attribute with flask-SQLAlchemy [duplicate]
                            
                                Error Installing Pyproj in Python 3.5
                            
                                Rearrange a pandas data frame to create a 2d ratings matrix
                            
                                Accelerating one-to-many correlation calculations in Python
                            
                                Feeding a Python array into a Perl script
                            
                                PyImport_ImportModule, possible to load module from memory?
                            
                                Normalize the elements of columns in an array to 1 or -1 depending on their sign
                            
                                Passing Python functions as objects to Spark
                            
                                How can I slice a dataframe by timestamp, when timestamp isn't classified as index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With