Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

Best way to strip out everything but text from a webpage?

Tags:

python

I'm looking to take an html page and just extract the pure text on that page. Anyone know of a good way to do that in python?

I want to strip out literally everything and be left with just the text of the articles and what ever other text is between tags. JS, css, etc... gone

thanks!

like image

405

asked Jun 04 '10 21:06

James

People also ask

How do you remove text from a website?

In this article, we delete text from an HTML document by using the <del> tag in the document. This tag stands for delete and is used to mark a portion of text which has been deleted from the document.

2 Answers

The first answer here doesn't remove the body of CSS or JavaScript tags if they are in the page (not linked). This might get closer:

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text

like image

102

answered Sep 22 '22 10:09

g.d.d.c

You could try the rather excellent Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

But be warned: what you get back from any parsing attempt will be subject to 'mistakes'. Bad HTML, bad parsing and just general unexpected output. If your source documents are well known and well presented you should be ok, or able to at least work around idiosyncrasies in them, but if it's just general stuff found "out on the internet" then expect all kinds of weird and wonderful outliers.

like image

23

answered Sep 20 '22 10:09

pycruft

Sign in to Comment

Related questions
                            
                                Function Decorators
                            
                                Printing floats with a specific number of zeros
                            
                                python singleton into multiprocessing
                            
                                Apache vs Twisted
                            
                                Implementing Server Push
                            
                                Comprehensions in Python and Javascript are only very basic?
                            
                                Python: Convert string into function name; getattr or equal?
                            
                                Python values with units
                            
                                lisp-style style `let` syntax in Python list-comprehensions
                            
                                Python - Idiom to check if string is empty, print default
                            
                                installing mechanize with easy_install
                            
                                How to organize Python source code files? [closed]
                            
                                Reading from the serial port from C++ or Python on windows
                            
                                Game cross-compiling and packaging
                            
                                Python: Get values (objects) from a dictionary of objects in which one of the object's field matches a value (or condition)
                            
                                Python: write to file multiple times without open/close for each write
                            
                                How should I rewrite my database execute/commit to make it amenable to unit testing?
                            
                                python: find and replace numbers < 1 in text file
                            
                                How to add http headers in suds 0.3.6?
                            
                                sort a list of percentages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With