Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Looping elements and get the text of current element only if there is any and their parents

I am learning BeautifulSoup and have a webpage that has a body something like this:

html:

<div>
 <table>
 <tr>
  <td>
   <div>
     this is div text
     <a name='abc'>this is anchor text</a>
   </div>
  </td>
 </tr>
</table>
</div>

Expected result:

tag     text                   parents
===     =====                  =======
div     ""                     ""
table   ""                     div
...
div       this is div text     div.table.tr.td
a         this is anchor text  div.table.tr.td.a

I am able to get the result but the problem is in div I am getting the anchor text also which is as below

div       this is div text this is anchor text     div.table.tr.td
a         this is anchor text                      div.table.tr.td.a

Below is my code

f = open("C:/abc.html",encoding="utf8")  
soup=BeautifulSoup(f,"lxml")
f.close()
for tag in soup.find_all():
      allparent=""
      for parenttags in tag.findParents():
          allparent=parenttags.name+"."+allparent
      if allparent!="":
          allparent=allparent[:-1]
      print(tag.name+"', '"+tag.text+"','"+allparent)
like image 991
शेखर Avatar asked Jan 23 '26 00:01

शेखर


1 Answers

You are looking for tag.find(text=True)

If tag is your <div>foo<span>bar</span></div>:

  • tag.find(text=True) will output foo
  • tag.text will output foo bar.

So, in your case, just replace

print(tag.name+"', '"+tag.text+"','"+allparent)`

by

print(tag.name+"', '"+tag.find(text=True)+"','"+allparent)

Or better,

print('"{}", "{}", "{}"'.format(tag.name, tag.find(text=True), allparent))

Isn't that sexier?!

like image 198
Arount Avatar answered Jan 26 '26 01:01

Arount



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!