I am trying to use beautifulsoup to parse a table from a website. (I am unable to share the website source code as it is restricted use.)
I am trying to extract the data only if it has following two tags with these specific classes.
td, width=40%
tr, valign=top
My reason for doing this is to extract data which has both these tags and class.
I found some discussion on using multiple tags here but this one talks about only tags but not classes. However, I did try to extend the code with same logic of using a list but I think what I get is not what I want:
my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])
Summarizing, my query is how to use multiple tag with each having a specific class in find_all, so that the result 'ands' both the tags.
To find multiple tags, you can use the , CSS selector, where you can specify multiple tags separated by a comma , .
Step-by-step ApproachStep 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
You can use an re.compile
object with soup.find_all
:
import re
from bs4 import BeautifulSoup as soup
html = """
<table>
<tr style='width:40%'>
<td style='align:top'></td>
</tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})
Output:
[<tr style="width:40%">
<td style="align:top"></td>
</tr>, <td style="align:top"></td>]
By providing the re.compile
object to specify the desired tags and style
values, find_all
will return any instances of tr
or td
tag containing an inline style
attribute of either width:40%
or align:top
.
This method can be extrapolated upon to find elements by providing multiple attribute values:
html = """
<table>
<tr style='width:40%'>
<td style='align:top' class='get_this'></td>
<td style='align:top' class='ignore_this'></td>
</tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})
Output:
[<td class="get_this" style="align:top"></td>]
Edit 2: Simple recursive solution:
import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
yield d
for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
yield from get_tags(i, params)
html = """
<table>
<tr style='align:top'>
<td style='width:40%'></td>
<td style='align:top' class='ignore_this'></td>
</tr>
</table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))
Output:
[<tr style="align:top">
<td style="width:40%"></td>
<td class="ignore_this" style="align:top"></td>
</tr>, <td style="width:40%"></td>]
The recursive function enables you to provide your own dictionary with desired target attributes for certain tags: this solution attempts to match any of the specified attributes to the bs4
object passed to the function, and if a match is discovered, the element is yield
ed.
Let's say bsObj is your beautiful soup object Try:
tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With