Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup tag inside tag

I'm trying to add a new link as an unordered list element.

But I can't add a tag inside another with Beautiful Soup.

with open('index.html') as fp:
    soup = BeautifulSoup(fp, 'html.parser')

a = soup.select_one("id[class=pr]")
ntag1 = soup.new_tag("a", href="hm/test")
ntag1.string = 'TEST'
... (part with problem)
a.insert_after(ntag2)

ntag1 must stay inside "<li>", so I tried

   ntag2 = ntag1.new_tag('li')  
   TypeError: 'NoneType' object is not callable

with wrap()

 ntag2 = ntag1.wrap('li')
   ValueError: Cannot replace one element with another when theelement to be replaced is not part of a tree.

Original HTML

<id class="pr">
    </id>
    <li>
     <a href="pr/protocol">
      protocol
     </a>

Desirable html output

<id class="pr">
</id>
<li>
 <a href="hm/test">
  TEST
 </a>
</li>
<li>
 <a href="pr/protocol">
  protocol
 </a>
</li>
like image 559
Joao Vitorino Avatar asked Feb 07 '18 19:02

Joao Vitorino


1 Answers

Why you get a NoneType error is because ntag2 = ntag1.new_tag('li') is trying to call a method the Tag object doesn't have.

The Cannot replace one element with another when theelement is from the fact you have created a tag that has no association to the tree, it has no parent which it must have if you are trying to wrap.

It would make more sense to create the parent li and just append the anchor child:

html = """<div class="pr">
</div>
<li>
 <a href="pr/protocol">
  protocol
 </a>
 </li>"""

soup = BeautifulSoup(html, "lxml")

a = soup.select_one("div[class=pr]")

# Li parent
parent = soup.new_tag("li", class_="parent")
# Child anchor
child = soup.new_tag("a", href="hm/test", class_="child")
child.string = 'TEST'
# Append child to parent
parent.append(child)
# Insert parent
a.insert_after(parent)
print(soup.prettify())

which would give you the output you want bar the html not being valid.

If you have an actual ul you want to get to after a certain element, i.e.

html = """<div class="pr">
    </div>
    <ul>
        <li>
          <a href="pr/protocol">
          protocol
          </a>
         </li>
     </ul>
     """

Set a's css selector to div[class=pr] + ul" and insert the parent:

a = soup.select_one("div[class=pr] + ul")
.....
a.insert(0, parent)
print(soup.prettify())

Which would give you:

<html>
 <body>
  <div class="pr">
  </div>
  <ul>
   <li class_="parent">
    <a class_="child" href="hm/test">
     TEST
    </a>
   </li>
   <li>
    <a href="pr/protocol">
     protocol
    </a>
   </li>
  </ul>
 </body>
</html>

Of if you wanted to wrap one existing tag:

from bs4 import BeautifulSoup, Tag

html = """<div class="pr">
    </div>
     <a href="pr/protocol">
          protocol
     """

soup = BeautifulSoup(html, "lxml")

a = soup.select_one("div[class=pr] + a")
a.wrap(Tag(name="div"))
print(soup.prettify())

Which would wrap the existing anchor:

<html>
 <body>
  <div class="pr">
  </div>
  <div>
   <a href="pr/protocol">
    protocol
   </a>
  </div>
 </body>
</html>
like image 143
Padraic Cunningham Avatar answered Sep 17 '22 05:09

Padraic Cunningham