Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text summarization: how to choose the right n-gram size

I am working on summarizing texts, using nltk library I am able to extract bigrams unigrams and trigrams and order them by frequency

As I am very new to this area (NLP) I was wondering if I can use a statistical model that will allow me to automatically choose the right size of Ngrams (what I mean by size the length of the N-gram one word unigram, two words bigram, or 3 words trigram)

Example, let's say I have this text that I want to summarize, and as a summary I will keep just the 5 most relevant N-grams:

"A more principled way to estimate sentence importance is using random walks 
and eigenvector centrality. LexRank[5] is an algorithm essentially identical 
to TextRank, and both use this approach for document summarization. The two 
methods were developed by different groups at the same time, and LexRank 
simply focused on summarization, but could just as easily be used for
keyphrase extraction or any other NLP ranking task." wikipedia

Then as an output I want to have, "random walks", "texRank", "lexRanks", "document summarization", "keyphrase extraction", " NLP ranking task"

In other words my is question : How to infer that a unigram will be more relevant than a bigram or trigram? (using just frequency as measure of the relevance of an N-gram will not give me the results that I want to have)

Can anyone point to me a research paper, an algorithm or a course where such a method has been already used or explained

Thank you in advance.

like image 375
sel Avatar asked Jan 21 '15 16:01

sel


1 Answers

Considering that you have a corpus, you can try using topic modeling technologies (such as Biterm) to help you inferring the most relevant terms to a given topic, being that your terms could also be n-grams. This would be a probabilistic approximation, since, as you mentioned, simply counting frequencies did not yield good results.

Of course, this approach considers lemmatization and stopwords removal.

like image 120
Felipe Martins Melo Avatar answered Sep 27 '22 23:09

Felipe Martins Melo