I am trying to use BeautifulSoup to get text from web pages.
Below is a script I've written to do so. It takes two arguments, first is the input HTML or XML file, the second output file.
import sys
from bs4 import BeautifulSoup
def stripTags(s): return BeautifulSoup(s).get_text()
def stripTagsFromFile(inFile, outFile):
open(outFile, 'w').write(stripTags(open(inFile).read()).encode("utf-8"))
def main(argv):
if len(sys.argv) <> 3:
print 'Usage:\t\t', sys.argv[0], 'input.html output.txt'
return 1
stripTagsFromFile(sys.argv[1], sys.argv[2])
return 0
if __name__ == "__main__":
sys.exit(main(sys.argv))
Unfortunately, for many web pages, for example: http://www.greatjobsinteaching.co.uk/career/134112/Education-Manager-Location I get something like this (I'm showing only few first lines):
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
Education Manager Job In London With Caleeda | Great Jobs In Teaching
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-15255540-21']);
_gaq.push(['_trackPageview']);
_gaq.push(['_trackPageLoadTime']);
Is there anything wrong with my script? I was trying to pass 'xml' as the second argument to BeautifulSoup's constructor, as well as 'html5lib' and 'lxml', but it doesn't help. Is there an alternative to BeautifulSoup which would work better for this task? All I want is to extract the text which would be rendered in a browser for this web page.
Any help will be much appreciated.
nltk's clean_html()
is quite good at this!
Assuming that your already have your html stored in a variable html
like
html = urllib.urlopen(address).read()
then just use
import nltk
clean_text = nltk.clean_html(html)
UPDATE
Support for clean_html
and clean_url
will be dropped for future versions of nltk. Please use BeautifulSoup for now...it's very unfortunate.
An example on how to achieve this is on this page:
BeatifulSoup4 get_text still has javascript
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With