Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out HTML tags and resolve entities in python

Tags:

python

html

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

like image 523
akraut Avatar asked Sep 01 '08 05:09

akraut


People also ask

How do you remove HTML tags in Python?

Remove HTML tags from string in python Using the Beautifulsoup Module. Like the lxml module, the BeautifulSoup module also provides us with various functions to process text data. To remove HTML tags from a string using the BeautifulSoup module, we can use the BeautifulSoup() method and the get_text() method.

How do I get data from HTML to Python?

Sending an HTTP GET request to the URL of the webpage that you want to scrape, which will respond with HTML content. We can do this by using the Request library of Python. Fetching and parsing the data using Beautifulsoup and maintain the data in some data structure such as Dict or List.


1 Answers

Use lxml which is the best xml/html library for python.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

And if you just want to sanitize the html look at the lxml.html.clean module

like image 73
Peter Hoffmann Avatar answered Sep 19 '22 07:09

Peter Hoffmann