I am trying to get all <tr class="**colour blue** attr1 attr2">
from a page.
The attrs
are different each time, and some of the other sibling <tr>s
have colour red
, colour pink
etc. classes.
So I'm looking for any other characters after colour blue
in class
to be included in the result. I've tried using *
, but it didn't work:
soup.find_all('tr', {'class': 'colour blue*'})
Thank you
Beautiful Soup's find_all(~) method returns a list of all the tags or strings that match a particular criteria.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
Beautiful Soup is a Python library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser and provides Pythonic idioms for iterating, searching, and modifying the parse tree.
You can use commonly-used CSS Selectors with beautiful soup:
>>> soup = BeautifulSoup('''
... <tr class="colour blue attr1 attr2"></tr>
... <tr class="colour red attr1 attr2"></tr>
... <tr class="unwanted attr1 attr2"></tr>
... <tr class="colour blue attr3"></tr>
... <tr class="another attr1 attr2"></tr>
... ''')
>>> soup.select('tr.colour.blue')
[<tr class="colour blue attr1 attr2"></tr>, <tr class="colour blue attr3"></tr>]
tr.colours.blue
selector will match tr
as long as it has colours
and blue
class attributes.
Use regex filter:
import re
soup.find_all('tr', class_=re.compile(r'colour blue.+'))
In regex, it uses re.search()
to find the string.
.
means match any character except the newline.
+
means match .
more than one time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With