My final year engineering project requires me to build an application using Java or Python which summarizes a text document using Natural Language Processing. How do I even begin with the programming of such an application?
Based on some research, I've just noted down that extraction-based summarization will be the best bet for me since it isn't so complex as abstraction based algorithms. Even then, it'd be really helpful if someone would guide me in the right direction to go about this.
Text summarization using the frequency method In this method we find the frequency of all the words in our text data and store the text data and its frequency in a dictionary. After that, we tokenize our text data. The sentences which contain more high frequency words will be kept in our final summary data.
LSA (Latent semantic analysis) Latent Semantic Analysis is a unsupervised learning algorithm that can be used for extractive text summarization.
Text summarization is still an open problem in NLP.
I guess that you might start by asking yourself what is the purpose of the summary:
Because this will influence the way you generate the summary.
But as a start you could use in python the NLTK framework to extract basic elements from a text. For example you can extract the most frequent words, or the most frequent N-grams( N adjacent words) from the text.
Also a simple way to extract the most relevant sentences is using TF-IDF that stands for term frequency, Inverse document frequency. Basically this function gives higher scores to sentences that tend to appear frequently in one document compared to other document.
Some python libraries that you can use :
Some helpful resources:
Hope this helps.
These days, using Neural Net to summarize the corpus is considered state of the art.
Here is an article worth reading for you: A Neural Attention Model for Sentence Summarization http://www.aclweb.org/anthology/D15-1044
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With