I am using python Beautiful soup to get the contents of:
<div class="path">
<a href="#"> abc</a>
<a href="#"> def</a>
<a href="#"> ghi</a>
</div>
My code is as follows:
html_doc="""<div class="path">
<a href="#"> abc</a>
<a href="#"> def</a>
<a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi
as a single string?
Also I want to know about the output so obtained.
If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item)
will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div
of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With