BeautifulSoup with multiple tags, each tag with a specific class

Tags:

I am trying to use beautifulsoup to parse a table from a website. (I am unable to share the website source code as it is restricted use.)

I am trying to extract the data only if it has following two tags with these specific classes.

td, width=40%
tr, valign=top

My reason for doing this is to extract data which has both these tags and class.

I found some discussion on using multiple tags here but this one talks about only tags but not classes. However, I did try to extend the code with same logic of using a list but I think what I get is not what I want:

 my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])

Summarizing, my query is how to use multiple tag with each having a specific class in find_all, so that the result 'ands' both the tags.

585

asked Nov 07 '16 08:11

PagMax

2 Answers

You can use an re.compile object with soup.find_all:

import re
from bs4 import BeautifulSoup as soup
html = """
  <table>
    <tr style='width:40%'>
      <td style='align:top'></td>
    </tr>
  </table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})

Output:

[<tr style="width:40%">
   <td style="align:top"></td>
 </tr>, <td style="align:top"></td>]

By providing the re.compile object to specify the desired tags and style values, find_all will return any instances of tr or td tag containing an inline style attribute of either width:40% or align:top.

This method can be extrapolated upon to find elements by providing multiple attribute values:

html = """
 <table>
   <tr style='width:40%'>
    <td style='align:top' class='get_this'></td>
    <td style='align:top' class='ignore_this'></td>
  </tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})

Output:

[<td class="get_this" style="align:top"></td>]

Edit 2: Simple recursive solution:

import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
  if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
     yield d
  for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
     yield from get_tags(i, params)

html = """
 <table>
  <tr style='align:top'>
    <td style='width:40%'></td>
    <td style='align:top' class='ignore_this'></td>
 </tr>
 </table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))

Output:

[<tr style="align:top">
  <td style="width:40%"></td>
  <td class="ignore_this" style="align:top"></td>
 </tr>, <td style="width:40%"></td>]

The recursive function enables you to provide your own dictionary with desired target attributes for certain tags: this solution attempts to match any of the specified attributes to the bs4 object passed to the function, and if a match is discovered, the element is yielded.

answered Sep 28 '22 17:09

Ajax1234

Let's say bsObj is your beautiful soup object Try:

tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})

Hope this helps.

answered Sep 28 '22 18:09

Tarun Gupta

Related questions
                            
                                Google App Engine custom 404 page for static files
                            
                                Apply custom cumulative function to pandas dataframe
                            
                                How to dynamically change depth in Django Rest Framework nested serializers?
                            
                                Testing a POST that uses Flask-WTF validate_on_submit
                            
                                Why does Python's float raise ValueError for some very long inputs?
                            
                                How to write a complete Python wrapper around a C Struct using Cython?
                            
                                Python to excel, openpyxl and file format not valid
                            
                                Django JSONField isnull lookup
                            
                                Importing GDAL prints lots of error messages, but still works
                            
                                OpenCV exception after 1 day calculation
                            
                                How do I tell sqlalchemy to ignore certain (say, null) columns on INSERT
                            
                                Flask FileStorage object to File Object
                            
                                Where's the logic that returns an instance of a subclass of OSError exception class?
                            
                                Self-built extension module slower than built-in c module
                            
                                Define a variable in sympy to be a CONSTANT
                            
                                Tuple assignment in Python, Is this a bug in Python? [duplicate]
                            
                                How to import modules from site-packages when in a different directory?
                            
                                Is it possible to def a function with a dotted name in Python?
                            
                                More efficient way to loop through PySpark DataFrame and create new columns
                            
                                How can I count each UDP packet sent out by subprocesses?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup with multiple tags, each tag with a specific class

Tags:

python

html

beautifulsoup

tags

findall

PagMax

People also ask

2 Answers

Ajax1234

Tarun Gupta

Recent Activity

Donate For Us