Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup with multiple tags, each tag with a specific class

I am trying to use beautifulsoup to parse a table from a website. (I am unable to share the website source code as it is restricted use.)

I am trying to extract the data only if it has following two tags with these specific classes.

td, width=40%
tr, valign=top

My reason for doing this is to extract data which has both these tags and class.

I found some discussion on using multiple tags here but this one talks about only tags but not classes. However, I did try to extend the code with same logic of using a list but I think what I get is not what I want:

 my_soup=soup.find_all(['td',{"width":"40%"},'tr',{'valign':'top'}])

Summarizing, my query is how to use multiple tag with each having a specific class in find_all, so that the result 'ands' both the tags.

like image 585
PagMax Avatar asked Nov 07 '16 08:11

PagMax


People also ask

How do I select multiple tags in BeautifulSoup?

To find multiple tags, you can use the , CSS selector, where you can specify multiple tags separated by a comma , .

How do you scrape nested tags with BeautifulSoup?

Step-by-step ApproachStep 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


2 Answers

You can use an re.compile object with soup.find_all:

import re
from bs4 import BeautifulSoup as soup
html = """
  <table>
    <tr style='width:40%'>
      <td style='align:top'></td>
    </tr>
  </table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top')})

Output:

[<tr style="width:40%">
   <td style="align:top"></td>
 </tr>, <td style="align:top"></td>]

By providing the re.compile object to specify the desired tags and style values, find_all will return any instances of tr or td tag containing an inline style attribute of either width:40% or align:top.

This method can be extrapolated upon to find elements by providing multiple attribute values:

html = """
 <table>
   <tr style='width:40%'>
    <td style='align:top' class='get_this'></td>
    <td style='align:top' class='ignore_this'></td>
  </tr>
</table>
"""
results = soup(html, 'html.parser').find_all(re.compile('td|tr'), {'style':re.compile('width:40%|align:top'), 'class':'get_this'})

Output:

[<td class="get_this" style="align:top"></td>]

Edit 2: Simple recursive solution:

import bs4
from bs4 import BeautifulSoup as soup
def get_tags(d, params):
  if any((lambda x:b in x if a == 'class' else b == x)(d.attrs.get(a, [])) for a, b in params.get(d.name, {}).items()):
     yield d
  for i in filter(lambda x:x != '\n' and not isinstance(x, bs4.element.NavigableString) , d.contents):
     yield from get_tags(i, params)

html = """
 <table>
  <tr style='align:top'>
    <td style='width:40%'></td>
    <td style='align:top' class='ignore_this'></td>
 </tr>
 </table>
"""
print(list(get_tags(soup(html, 'html.parser'), {'td':{'style':'width:40%'}, 'tr':{'style':'align:top'}})))

Output:

[<tr style="align:top">
  <td style="width:40%"></td>
  <td class="ignore_this" style="align:top"></td>
 </tr>, <td style="width:40%"></td>]

The recursive function enables you to provide your own dictionary with desired target attributes for certain tags: this solution attempts to match any of the specified attributes to the bs4 object passed to the function, and if a match is discovered, the element is yielded.

like image 69
Ajax1234 Avatar answered Sep 28 '22 17:09

Ajax1234


Let's say bsObj is your beautiful soup object Try:

tr = bsObj.findAll('tr', {'valign': 'top'})
td = tr.findAll('td', {'width': '40%'})

Hope this helps.

like image 22
Tarun Gupta Avatar answered Sep 28 '22 18:09

Tarun Gupta