Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regular expression for Beautiful Soup

I am using Beautiful Soup to pull out specific div tags, and it seems I can't use simple string matching.

The page has some tags in the form of

<div class="comment form new"...> 

which I want to ignore, and also some tags in the form of

<div class="comment comment-xxxx..."> 

where the x's represent an integer of arbitrary length, and the ellipses represents an arbitrary number of other values separated by white spaces (that I'm not concerned about). I can't figure out the correct regex expression, especially since I've never used python's re class.

Using

soup.find_all(class_="comment") 

finds all tags starting with the word comment. I have tried using

soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))

and lots of other variations, but I think I'm missing something obvious here about how regex expressions or match() work. Can anyone help me out?

like image 613
user1890572 Avatar asked Dec 10 '12 03:12

user1890572


People also ask

How do you get BeautifulSoup in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

What is the use of BeautifulSoup in Python?

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

How do I use beautifulsoup4 in Python 3?

Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself.


1 Answers

I think I've got it:

>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]

Notice that, unlike the equivalent in BS3, it's not this:

['comment form new', 'comment comment-xxxx...']

And that's why your regexps won't match.

But you can match, e.g., this:

>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]

Note that BS does the equivalent of re.search, not re.match, so you don't need 'comment-.*'. Of course if you want to match 'comment-12345' but not 'comment-of-another-kind you'd want, e.g., 'comment-\d+'.

like image 145
abarnert Avatar answered Oct 21 '22 17:10

abarnert