Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.
Remove HTML tags from string in python Using the Beautifulsoup Module. Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.
Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python. Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.
Use lxml which is the best xml/html library for python.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
And if you just want to sanitize the html look at the lxml.html.clean module
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With