Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to strip out everything but text from a webpage?

Tags:

python

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!

like image 405
James Avatar asked Jun 04 '10 21:06

James


People also ask

How do you remove text from a website?

In this article, we delete text from an HTML document by using the <del> tag in the document. This tag stands for delete and is used to mark a portion of text which has been deleted from the document.


2 Answers

The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text
like image 102
g.d.d.c Avatar answered Sep 22 '22 10:09

g.d.d.c


You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.

like image 23
pycruft Avatar answered Sep 20 '22 10:09

pycruft