<p>Can anyone help me traverse an html tree with beautiful soup?</p> <p>I'm trying to parse through html output and after gather each value then insert into a table named <code>Tld</code> with python/django</p> <pre class="prettyprint"><code><div class="rc" data-hveid="53"> <h3 class="r"> <a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a> </h3> </code></pre> <p>And only parse the value of <code>href</code> attribute of <code><a></code>, so only this part:</p> <pre class="prettyprint"><code>https://billing.anapp.com/ </code></pre> <p>of:</p> <pre class="prettyprint"><code><a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a> </code></pre> <p>I currently have:</p> <pre class="prettyprint"><code>for url in urls: mb.open(url) beautifulSoupObj = BeautifulSoup(mb.response().read()) beautifulSoupObj.find_all('h3',attrs={'class': 'r'}) </code></pre> <p>The problem is <code>find_all</code> above, isn't make it far enough to the <code><a></code> element.</p> <p>Any help is much appreciated. Thank you.</p>

<pre class="prettyprint"><code>from bs4 import BeautifulSoup html = """ <div class="rc" data-hveid="53"> <h3 class="r"> <a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a> </h3> """ bs = BeautifulSoup(html) elms = bs.select("h3.r a") for i in elms: print(i.attrs["href"]) </code></pre> <p>prints:</p> <pre class="prettyprint"><code>https://billing.anapp.com/ </code></pre> <p><code>h3.r a</code> is a css selector</p> <p>you can use css selector (i prefer them), xpath, or find in elements. the selector <code>h3.r a</code> will look for all <code>h3</code> with class <code>r</code> and get from inside them the <code>a</code> elements. it could be a more complicated example like <code>#an_id table tr.the_tr_class td.the_td_class</code> it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.</p> <p>this will also give you the same result. <code>find_all</code> returns a list of <code>bs4.element.Tag</code>, <code>find_all</code> has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.</p> <pre class="prettyprint"><code>for elm in bs.find_all('h3',attrs={'class': 'r'}): for a_elm in elm.find_all("a"): print(a_elm.attrs["href"]) </code></pre>

Parse href attribute value from element with Beautifulsoup and Mechanize

Tags:

python

parsing

html-parsing

beautifulsoup

django

Can anyone help me traverse an html tree with beautiful soup?

I'm trying to parse through html output and after gather each value then insert into a table named Tld with python/django

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

And only parse the value of href attribute of <a>, so only this part:

https://billing.anapp.com/

of:

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

I currently have:

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

The problem is find_all above, isn't make it far enough to the <a> element.

Any help is much appreciated. Thank you.

235

asked Nov 14 '13 16:11

CodeTalk

1 Answers

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

prints:

https://billing.anapp.com/

h3.r a is a css selector

you can use css selector (i prefer them), xpath, or find in elements. the selector h3.r a will look for all h3 with class r and get from inside them the a elements. it could be a more complicated example like #an_id table tr.the_tr_class td.the_td_class it will find an id given td's inside that belong to the tr with the given class and are inside a table of course.

this will also give you the same result. find_all returns a list of bs4.element.Tag, find_all has a recursive field not sure if you can do it in one line, i personaly prefer css selector because its easy and clean.

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

177

answered Sep 27 '22 22:09

Foo Bar User

Related questions
                            
                                Interpolate only one value of a TimeSerie using Python / Pandas
                            
                                Using NetworkX with matplotlib.ArtistAnimation
                            
                                Python Lambda Identity Matrix
                            
                                socket.gaierror: [Errno 11001] getaddrinfo failed
                            
                                how can I pass infinity to redis from python?
                            
                                Is there an equivalent to "continue" (python) in rebol?
                            
                                Python date string formatting
                            
                                Is it possible to have contextlib.closing() to call an arbitrary cleanup method instead of .close()
                            
                                Least square method in python [closed]
                            
                                Trying to install grequests on Ubuntu?
                            
                                Performance when passing huge list as argument in recursive function?
                            
                                Exceptions for the whole class
                            
                                ImportError: No module named netifaces
                            
                                Set "hide" attribute on folders in windows OS?
                            
                                Python error: "cannot find path specified"
                            
                                Trying to find a match in two strings - Python
                            
                                numpy array to list conversion issue
                            
                                creating Matlab cell arrays in python
                            
                                Javascript vs Python with respect to Python 'map()' function
                            
                                Python: Reverse DNS Lookup in a shared hosting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With