Identifying large bodies of text via BeautifulSoup or other python based extractors

Tags:

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.

The original plan was to use a ~~BeautifulSoup findAll(True)~~ and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)

But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.

Does anyone have any experience with this? Any help with something like this would be amazing.

At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.

EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.

Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:

#1 goose library Fast, powerful, consistent #2 readability library Content is passable, slower on average than goose but faster than boilerpipe #3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.

I'll release benchmarks perhaps if there is interest.

Indirectly related libraries, you should probably install them and read their docs:

NLTK text processing library This is too good not to install. They provide text analysis tools along with html tools (like cleanup, etc).
lxml html/xml parser Mentioned above. This beats BeautifulSoup in every aspect but usability. It's a bit harder to learn but the results are worth it. HTML parsing takes much less time, it's very noticeable.
python webscraper library I think the value of this code isn't the lib itself, but using the lib as a reference manual to build your own crawlers/extractors. It's very nicely coded / documented!

A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!

Goose library gets lots of solid maintenance, they just added Arabic support, it's great!

508

asked Jan 04 '13 20:01

Lucas Ou-Yang

1 Answers

You might look at the python-readability package which does exactly this for you.

108

answered Oct 05 '22 02:10

Kyle Maxwell

Related questions
                            
                                sqlalchemy raw sql query limit using connection.execute()
                            
                                f2py -- prevent array reordering
                            
                                PyQt4 @pyqtSlot: what is the result kwarg for?
                            
                                NLTK named entity recognition in dutch
                            
                                Convert np.ndarray to np.array in python
                            
                                How to create a list of random integer vector whose sum is x
                            
                                how to compare one item in a list with all the other items in this list, python
                            
                                Python: intersection of nested lists where order matters
                            
                                Using sparse matrices/online learning in Naive Bayes (Python, scikit)
                            
                                Optimize conversion between list of integer coefficients and its long integer representation
                            
                                How to define LTI systems with Time delay in Scipy?
                            
                                Testing functions returning iterable in python
                            
                                Creating python 2.7 daemon with pep-3143
                            
                                python separate round particles by offsetting contours / shrinking polygones
                            
                                Python TypeError: unsupported operand type(s) for -: 'int' and 'function'
                            
                                Pythonic way to Implement Data Types (Python 2.7)
                            
                                replace pattern with a sequential number string in python
                            
                                Django: test successful loading of static files
                            
                                Django testing and middleware
                            
                                Trouble trying to dynamically add methods to Python class (i.e. django-tables2 'Table')

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Identifying large bodies of text via BeautifulSoup or other python based extractors

Tags:

python

beautifulsoup

web-crawler

Lucas Ou-Yang

People also ask

1 Answers

Kyle Maxwell

Recent Activity

Donate For Us