Can anyone help me traverse an html tree with beautiful soup?
I'm trying to parse through html output and after gather each value then insert into a table named Tld
with python/django
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
And only parse the value of href
attribute of <a>
, so only this part:
https://billing.anapp.com/
of:
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
I currently have:
for url in urls:
mb.open(url)
beautifulSoupObj = BeautifulSoup(mb.response().read())
beautifulSoupObj.find_all('h3',attrs={'class': 'r'})
The problem is find_all
above, isn't make it far enough to the <a>
element.
Any help is much appreciated. Thank you.
Beautifulsoup: Get the attribute value of an element 1. Find all by ul tag. 2. Iterate over the result. 3. Get the class value of each element. In the following example, well get the href attribute value. 3. Beautifulsoup: Find all by multiple attributes
To get href with Python BeautifulSoup, we can use the find_all method. from BeautifulSoup import BeautifulSoup html = '''<a href="some_url">next</a> <span class="class"><a href="another_url">later</a></span>''' soup = BeautifulSoup (html) for a in soup.find_all ('a', href=True): print (a ['href'])
Attributes are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. A tag may have any number of attributes. For example, the tag <b class=”active”> has an attribute “class” whose value is “active”.
to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True.
from bs4 import BeautifulSoup
html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""
bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
print(i.attrs["href"])
prints:
https://billing.anapp.com/
h3.r a
is a css selector
you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a
will look for all h3
with class r
and get from inside them the a
elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class
it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.
this will also give you the same result. find_all
returns a list of bs4.element.Tag
, find_all
has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.
for elm in bs.find_all('h3',attrs={'class': 'r'}):
for a_elm in elm.find_all("a"):
print(a_elm.attrs["href"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With