Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove HTML block in Python

I'd like to know if there's a library or some method in Python to extract an element from an HTML document. For example:

I have this document:

<html>
      <head>
          ...
      </head>
      <body>
          <div>
           ...
          </div>
      </body>
</html>

I want to remove the <div></div> tag block along with the block contents from the document and then it'll be like that:

<html>
    <head>
     ...
    </head>
    <body>
    </body>
</html>
like image 890
JefersonM Avatar asked Apr 08 '26 23:04

JefersonM


1 Answers

You don't need a library for this. Just use built in string methods.

def removeOneTag(text, tag):
    return text[:text.find("<"+tag+">")] + text[text.find("</"+tag+">") + len(tag)+3:]

This will remove everything in-between the first opening and closing tag. So your input in the example would be something like...

    x = """<html>
    <head>
      ...
    </head>
    <body>
       <div>
         ...
       </div>
    </body>
</html>"""
print(removeOneTag(x, "div"))

Then if you wanted to remove ALL the tags...

while(tag in x):
    x = removeOneTag(x, tag)
like image 67
Wso Avatar answered Apr 11 '26 15:04

Wso



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!