Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse href attribute value from element with Beautifulsoup and Mechanize

Can anyone help me traverse an html tree with beautiful soup?

I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

And only parse the value of href attribute of <a>, so only this part:

https://billing.anapp.com/

of:

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

I currently have:

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

The problem is find_all above, isn't make it far enough to the <a> element.

Any help is much appreciated. Thank you.

like image 235
CodeTalk Avatar asked Nov 14 '13 16:11

CodeTalk


People also ask

How to get the attribute value of an element using beautifulsoup?

Beautifulsoup: Get the attribute value of an element 1. Find all by ul tag. 2. Iterate over the result. 3. Get the class value of each element. In the following example, well get the href attribute value. 3. Beautifulsoup: Find all by multiple attributes

How to get href with Python beautifulsoup?

To get href with Python BeautifulSoup, we can use the find_all method. from BeautifulSoup import BeautifulSoup html = '''<a href="some_url">next</a> <span class="class"><a href="another_url">later</a></span>''' soup = BeautifulSoup (html) for a in soup.find_all ('a', href=True): print (a ['href'])

What are attribute attributes in Beautiful Soup?

Attributes are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. A tag may have any number of attributes. For example, the tag <b class=”active”> has an attribute “class” whose value is “active”.

How to create soup object from HTML string in Java?

to create soup object with BeautifulSoup class called with the html string. Then we find the a elements with the href attribute returned by calling find_all with 'a' and href set to True.


1 Answers

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

prints:

https://billing.anapp.com/

h3.r a is a css selector

you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.

this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])
like image 177
Foo Bar User Avatar answered Sep 27 '22 22:09

Foo Bar User