Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find tags with only certain attributes - BeautifulSoup

How would I, using BeautifulSoup, search for tags containing ONLY the attributes I search for?

For example, I want to find all <td valign="top"> tags.

The following code: raw_card_data = soup.fetch('td', {'valign':re.compile('top')})

gets all of the data I want, but also grabs any <td> tag that has the attribute valign:top

I also tried: raw_card_data = soup.findAll(re.compile('<td valign="top">')) and this returns nothing (probably because of bad regex)

I was wondering if there was a way in BeautifulSoup to say "Find <td> tags whose only attribute is valign:top"

UPDATE FOr example, if an HTML document contained the following <td> tags:

<td valign="top">.....</td><br />
<td width="580" valign="top">.......</td><br />
<td>.....</td><br />

I would want only the first <td> tag (<td width="580" valign="top">) to return

like image 239
Snaxib Avatar asked Jan 19 '12 21:01

Snaxib


People also ask

How do I exclude tags in BeautifulSoup?

Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

How do you scrape a tag with BeautifulSoup?

Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.


3 Answers

As explained on the BeautifulSoup documentation

You may use this :

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

EDIT :

To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :

from BeautifulSoup import BeautifulSoup

html = '<td valign="top">.....</td>\
        <td width="580" valign="top">.......</td>\
        <td>.....</td>'

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

for result in results :
    if len(result.attrs) == 1 :
        print result

That returns :

<td valign="top">.....</td>
like image 162
Loïc G. Avatar answered Oct 11 '22 12:10

Loïc G.


You can use lambda functions in findAll as explained in documentation. So that in your case to search for td tag with only valign = "top" use following:

td_tag_list = soup.findAll(
                lambda tag:tag.name == "td" and
                len(tag.attrs) == 1 and
                tag["valign"] == "top")
like image 62
Yogesh Avatar answered Oct 11 '22 12:10

Yogesh


if you want to only search with attribute name with any value

from bs4 import BeautifulSoup
import re

soup= BeautifulSoup(html.text,'lxml')
results = soup.findAll("td", {"valign" : re.compile(r".*")})

as per Steve Lorimer better to pass True instead of regex

results = soup.findAll("td", {"valign" : True})
like image 55
Amr Avatar answered Oct 11 '22 12:10

Amr