Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python strategy for extracting text from malformed html pages


I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.

pyparsing and html2text.py also did not seem to work for malformed html pages.

Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html

My current implementation is approximately the following:

# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s) 
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
for i in c:
body = bsoup.body(text=True)
text = ''.join(body) 
# if BeautifulSoup  can't handle it, 
# alter html by trying to find 1st instance of  "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html 

if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.

I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.

Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.

like image 588
Johnny4000 Avatar asked Oct 23 '09 18:10


1 Answers

Try not to laugh, but:

class TextFormatter:
    def __init__(self,lynx='/usr/bin/lynx'):
        self.lynx = lynx

    def html2text(self, unicode_html_source):
        "Expects unicode; returns unicode"
        return Popen([self.lynx, 

I hope you've got lynx!

like image 198
Jonathan Feinberg Avatar answered Oct 05 '22 14:10

Jonathan Feinberg