I am using Beautiful Soup to pull out specific div tags, and it seems I can't use simple string matching.
The page has some tags in the form of
<div class="comment form new"...>
which I want to ignore, and also some tags in the form of
<div class="comment comment-xxxx...">
where the x's represent an integer of arbitrary length, and the ellipses represents an arbitrary number of other values separated by white spaces (that I'm not concerned about). I can't figure out the correct regex expression, especially since I've never used python's re class.
Using
soup.find_all(class_="comment")
finds all tags starting with the word comment. I have tried using
soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))
and lots of other variations, but I think I'm missing something obvious here about how regex expressions or match() work. Can anyone help me out?
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself.
I think I've got it:
>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]
Notice that, unlike the equivalent in BS3, it's not this:
['comment form new', 'comment comment-xxxx...']
And that's why your regexps won't match.
But you can match, e.g., this:
>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]
Note that BS does the equivalent of re.search
, not re.match
, so you don't need 'comment-.*'
. Of course if you want to match 'comment-12345'
but not 'comment-of-another-kind
you'd want, e.g., 'comment-\d+'
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With