Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting specific <tr> tags with BeautifulSoup

I am fetching some html table rows with BeautifulSoup with this piece of code:

from bs4 import BeautifulSoup
import urllib2
import re

page = urllib2.urlopen('www.something.bla')
soup = BeautifulSoup(page)
rows = soup.findAll('tr', attrs={'class': re.compile('class1.*')})

This is what I get as a result:

<tr class="class1 class2 class3">...</tr>
<tr class="class1 class2 class3">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<tr class="class1 class5">...</tr>
<tr class="class1_a class5_a">...</tr>
<!-- etc. -->

However, I'd like to exclude (or not select them in the first place) those rows which have class1 class2 class3 as an attribute.

How can I do that?
Thanks for help!

like image 355
errata Avatar asked Feb 12 '12 23:02

errata


1 Answers

Perhaps it's easier without regex. This works with BeautifulSoup 3:

from BeautifulSoup import BeautifulSoup

page = """
<tr class="class1 class2 class3">1</tr>
<tr class="class1 class2 class3">2</tr>
<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>
<tr>7</tr>"""

def cond(x):
    if x:
        return x.startswith("class1") and not "class2 class3" in x
    else:
        return False

soup = BeautifulSoup(page)
rows = soup.findAll('tr', {'class': cond})

for row in rows:
    print row

=>

<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>

With BeautifulSoup 4, I was able to make it work as follows:

import re
from bs4 import BeautifulSoup

page = """
<tr class="class1 class2 class3">1</tr>
<tr class="class1 class2 class3">2</tr>
<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>
<tr>7</tr>"""

soup = BeautifulSoup(page)
rows = soup.find_all('tr', {'class': re.compile('class1.*')})

for row in rows:
    cls = row.attrs.get("class")
    if not ("class2" in cls or "class3" in cls):
        print row

=>

<tr class="class1 class5">3</tr>
<tr class="class1_a class5_a">4</tr>
<tr class="class1 class5">5</tr>
<tr class="class1_a class5_a">6</tr>

In BS4, multi-valued attributes like class have lists of strings as their values, not strings. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id12.

like image 200
mzjn Avatar answered Nov 15 '22 02:11

mzjn