Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a nested element in beautiful soup

Tags:

I am struggling with the syntax required to grab some hrefs in a td. The table, tr and td elements dont have any class's or id's.

If I wanted to grab the anchor in this example, what would I need?

< tr > < td > < a >...

Thanks

like image 710
joepour Avatar asked Jun 29 '09 14:06

joepour


People also ask

How do I access nested tags in BeautifulSoup?

Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.


2 Answers

As per the docs, you first make a parse tree:

import BeautifulSoup html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>" soup = BeautifulSoup.BeautifulSoup(html) 

and then you search in it, for example for <a> tags whose immediate parent is a <td>:

for ana in soup.findAll('a'):   if ana.parent.name == 'td':     print ana["href"] 
like image 178
Alex Martelli Avatar answered Nov 06 '22 04:11

Alex Martelli


Something like this?

from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) anchors = [td.find('a') for td in soup.findAll('td')] 

That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.

UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:

from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a] 

Which basically just adds a check to see if you have an actual element returned by td.find('a').

like image 20
John Montgomery Avatar answered Nov 06 '22 04:11

John Montgomery