Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I emulate ":contains" using BeautifulSoup?

I'm working on a project where I need to a bit of scraping. The project is on Google App Engine, and we're currently using Python 2.5. Ideally, we would use PyQuery but due to running on App Engine and Python 2.5, this is not an option.

I've seen questions like this one on finding an HTML tag with certain text, but they don't quite hit the mark.

I have some HTML that looks like this:

<div class="post">
    <div class="description">
        This post is about <a href="http://www.wikipedia.org">Wikipedia.org</a>
    </div>
</div>
<!-- More posts of similar format -->

In PyQuery, I could do something like this (as far as I know):

s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text

Naively, I had though that I could do something like this in BeautifulSoup:

soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []

However, that yielded no results. I changed my query to use a regular expression, and got a bit further, but still no luck:

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []

It works if I omit Google.com, but then I need to do all the filtering manually. Is there anyway to emulate :contains using BeautifulSoup?

Alternatively, is there some PyQuery-like library that works on App Engine (on Python 2.5)?

like image 885
NT3RP Avatar asked Jun 06 '12 17:06

NT3RP


1 Answers

From the BeautifulSoup docs (emphasis mine):

"text is an argument that lets you search for NavigableString objects instead of Tags"

That is to say, your code:

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))

Is not the same as:

regex = re.compile('.*This post is about.*Google.com.*')
[post for post in soup.findAll(True, 'post') if regex.match(post.text)]

The reason you have to remove the Google.com is that there's a NavigableString object in the BeautifulSoup tree for "This post is about", and another one for "Google.com", but they're under different elements.

Incidentally, post.text exists but is not documented, so I wouldn't rely on that either, I wrote that code by accident! Use some other means of smushing together all the text under post.

like image 78
Steve Jessop Avatar answered Oct 18 '22 14:10

Steve Jessop