Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get started with a project on Text Summarization using NLP?

My final year engineering project requires me to build an application using Java or Python which summarizes a text document using Natural Language Processing. How do I even begin with the programming of such an application?

Based on some research, I've just noted down that extraction-based summarization will be the best bet for me since it isn't so complex as abstraction based algorithms. Even then, it'd be really helpful if someone would guide me in the right direction to go about this.

like image 481
Hamza Moiyadi Avatar asked Jun 21 '16 08:06

Hamza Moiyadi


People also ask

How do you do text summarization in NLP?

Text summarization using the frequency method In this method we find the frequency of all the words in our text data and store the text data and its frequency in a dictionary. After that, we tokenize our text data. The sentences which contain more high frequency words will be kept in our final summary data.

Which algorithm is best for text summarization?

LSA (Latent semantic analysis) Latent Semantic Analysis is a unsupervised learning algorithm that can be used for extractive text summarization.


2 Answers

Text summarization is still an open problem in NLP.

I guess that you might start by asking yourself what is the purpose of the summary:

  • A summary that discriminates a document from other documents
  • A summary that mines only the frequent patterns
  • A summary that covers all the topics in the document
  • etc

Because this will influence the way you generate the summary.

But as a start you could use in python the NLTK framework to extract basic elements from a text. For example you can extract the most frequent words, or the most frequent N-grams( N adjacent words) from the text.

Also a simple way to extract the most relevant sentences is using TF-IDF that stands for term frequency, Inverse document frequency. Basically this function gives higher scores to sentences that tend to appear frequently in one document compared to other document.

Some python libraries that you can use :

  • sickitlearn that has more advanced features.
  • Also gensim library has a text summarization tutorial (also in python)
  • You can also use Dato that has as well a text analysis module.

Some helpful resources:

  • This book: Foundations of Statistical Natural Language Processing
  • There is also a coursera course that you can enroll in, in order to understand the basics in text mining: https://www.coursera.org/learn/text-mining
  • Also this coursera course from stanford university (TF-IDF is explained in one of the videos) https://class.coursera.org/nlp/lecture/preview

Hope this helps.

like image 107
sel Avatar answered Nov 10 '22 20:11

sel


These days, using Neural Net to summarize the corpus is considered state of the art.

Here is an article worth reading for you: A Neural Attention Model for Sentence Summarization http://www.aclweb.org/anthology/D15-1044

like image 43
aerin Avatar answered Nov 10 '22 19:11

aerin