Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out the summarized text of a given URL in python / Django? [closed]

Tags:

python

django

How to find out the summarized text for a given URL?

What do i mean by summarized text?

Merck $41.1 Billion Schering-Plough Bid Seeks Science

Link Descrption

Merck & Co.’s $41.1 billion purchase of Schering-Plough Corp. adds experimental drugs for blood clots, infections and schizophrenia and allows the companies to speed research on biotechnology drugs.

For the above URL the below three lines is the summary text.
A short 2 to 3 line description of the URL which we usually obtain by fetching that page , examining the content thereafter figuring out short description from that html markup.

Are there any good algorithm which does this? (or)
Are there any good libraries in python/django which does this?

like image 988
Rama Vadakattu Avatar asked Mar 09 '09 15:03

Rama Vadakattu


2 Answers

I had the same need, and lemur, although it has summarization capabilities, I found it buggy to the point of being unusable. Over the weekend I used nltk to code up a summarize module in python: https://github.com/thavelick/summarize

I took the algorithm from the Java library Classifier4J here: http://classifier4j.sourceforge.net/ but used nltk and a python wherever possible.

Here is the basic usage:

>>> import summarize

A SimpleSummarizer (currently the only summarizer) makes a summary by using sentences with the most frequent words:

>>> ss = summarize.SimpleSummarizer()
>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text.'

You can specify any number of sentenecs in the summary as you like.

>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries. A Summariser is really cool. I don't think there are any other python summarisers."
>>> ss.summarize(input, 2)
"NLTK is a python library for working human-written text.  I don't think there are any other python summarisers."

Unlike the original algorithm from Classifier4J, this summarizer works correctly with punctuation other than periods:

>>> input = "NLTK is a python library for working human-written text! Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text!'

UPDATE

I've now (finally!) released this under the Apache 2.0 license, the same license as nltk, and put the module up on github (see above). Any contributions or suggestions are welcome.

like image 150
Tristan Havelick Avatar answered Oct 04 '22 01:10

Tristan Havelick


Text summarization is a fairly complicated topic. If you have a need to do this in a serious way, you may wish to look at projects like Lemur (http://www.lemurproject.org/).

However, what I suspect you really want is a text abstract here. If you know what part of the document contains the body text, locate it using an HTML parsing library like BeautifulSoup, and then strip out the HTML; take the first sentence, or first N characters (which ever suits best), and use that. Sort of a poor cousin's abstract-generator :-)

like image 31
Jarret Hardie Avatar answered Oct 04 '22 01:10

Jarret Hardie