I am learning BeautifulSoup and have a webpage that has a body something like this:
html:
<div>
<table>
<tr>
<td>
<div>
this is div text
<a name='abc'>this is anchor text</a>
</div>
</td>
</tr>
</table>
</div>
Expected result:
tag text parents
=== ===== =======
div "" ""
table "" div
...
div this is div text div.table.tr.td
a this is anchor text div.table.tr.td.a
I am able to get the result but the problem is in div I am getting the anchor text also which is as below
div this is div text this is anchor text div.table.tr.td
a this is anchor text div.table.tr.td.a
Below is my code
f = open("C:/abc.html",encoding="utf8")
soup=BeautifulSoup(f,"lxml")
f.close()
for tag in soup.find_all():
allparent=""
for parenttags in tag.findParents():
allparent=parenttags.name+"."+allparent
if allparent!="":
allparent=allparent[:-1]
print(tag.name+"', '"+tag.text+"','"+allparent)
You are looking for tag.find(text=True)
If tag is your <div>foo<span>bar</span></div>:
tag.find(text=True) will output footag.text will output foo bar.So, in your case, just replace
print(tag.name+"', '"+tag.text+"','"+allparent)`
by
print(tag.name+"', '"+tag.find(text=True)+"','"+allparent)
Or better,
print('"{}", "{}", "{}"'.format(tag.name, tag.find(text=True), allparent))
Isn't that sexier?!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With