Python regular expression for Beautiful Soup

Tags:

I am using Beautiful Soup to pull out specific div tags, and it seems I can't use simple string matching.

The page has some tags in the form of

<div class="comment form new"...>

which I want to ignore, and also some tags in the form of

<div class="comment comment-xxxx...">

where the x's represent an integer of arbitrary length, and the ellipses represents an arbitrary number of other values separated by white spaces (that I'm not concerned about). I can't figure out the correct regex expression, especially since I've never used python's re class.

Using

soup.find_all(class_="comment")

finds all tags starting with the word comment. I have tried using

soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))

and lots of other variations, but I think I'm missing something obvious here about how regex expressions or match() work. Can anyone help me out?

613

asked Dec 10 '12 03:12

user1890572

1 Answers

I think I've got it:

>>> [div['class'] for div in soup.find_all('div')]
[['comment', 'form', 'new'], ['comment', 'comment-xxxx...']]

Notice that, unlike the equivalent in BS3, it's not this:

['comment form new', 'comment comment-xxxx...']

And that's why your regexps won't match.

But you can match, e.g., this:

>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]

Note that BS does the equivalent of re.search, not re.match, so you don't need 'comment-.*'. Of course if you want to match 'comment-12345' but not 'comment-of-another-kind you'd want, e.g., 'comment-\d+'.

145

answered Oct 21 '22 17:10

abarnert

Related questions
                            
                                List of strings to integers while keeping a format in python
                            
                                finding binomial co-effecient modulo prime number,Interview street challenge
                            
                                Python, Filter a List of Objects, but return a specific attribute?
                            
                                Python - printing to more than one output [duplicate]
                            
                                matplotlib: inset axes for multiple boxplots
                            
                                could not determine data type of parameter $1 in python-pgsql
                            
                                Get NDB query length - using Python on Google App Engine
                            
                                django request.POST contains <could not parse>
                            
                                Python list append
                            
                                The Requests streaming example does not work in my environment
                            
                                Python requests - saving cookie for later url usage
                            
                                Handle multiple socket connections
                            
                                Python 3 - reading text from a file
                            
                                tkinter/py2app created application doesn't show window on initial launch
                            
                                What is an elegant way to select all non-None elements from parameters and place them in a python dictionary?
                            
                                Recompile MacPort's version of MacVim with Python, Ruby & Perl [closed]
                            
                                ignoring directories in os.walk()?
                            
                                Read FORTRAN formatted numbers with Python
                            
                                Password protect a whole django app
                            
                                Calculate daily sums using python pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python regular expression for Beautiful Soup

Tags:

python

regex

beautifulsoup

user1890572

People also ask

1 Answers

abarnert

Recent Activity

Donate For Us