Using a regex, you can clean everything inside <>
:
import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext
Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm
'. If that is the case, then you might want to write the regex as
CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
This link contains more details on this.
You could also use BeautifulSoup
additional package to find out all the raw text.
You will need to explicitly set a parser when calling BeautifulSoup
I recommend "lxml"
as mentioned in alternative answers (much more robust than the default one (html.parser
) (i.e. available without additional install).
from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text
But it doesn't prevent you from using external libraries, so I recommend the first solution.
EDIT: To use lxml
you need to pip install lxml
.
Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree
, which works (somewhat) similarly to the lxml example you mention:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
Note that this isn't perfect, since if you had something like, say, <a title=">">
it would break. However, it's about the closest you'd get in non-library Python without a really complex function:
import re
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
However, as lvc mentions xml.etree
is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml
version:
def remove_tags(text):
return ''.join(xml.etree.ElementTree.fromstring(text).itertext())
There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw
You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s
PS - If you're interested in the class(about smart debugging with python) I give you a link: https://www.udacity.com/course/software-debugging--cs259. It's free!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With