I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good way to do that in python?
I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone
thanks!
In this article, we delete text from an HTML document by using the <del> tag in the document. This tag stands for delete and is used to mark a portion of text which has been deleted from the document.
The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:
def stripTags(text):
scripts = re.compile(r'<script.*?/script>')
css = re.compile(r'<style.*?/style>')
tags = re.compile(r'<.*?>')
text = scripts.sub('', text)
text = css.sub('', text)
text = tags.sub('', text)
return text
You could try the rather excellent Beautiful Soup
f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()
But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With