Text summarization: how to choose the right n-gram size

Question

I am working on summarizing texts, using nltk library I am able to extract bigrams unigrams and trigrams and order them by frequency

As I am very new to this area (NLP) I was wondering if I can use a statistical model that will allow me to automatically choose the right size of Ngrams (what I mean by size the length of the N-gram one word unigram, two words bigram, or 3 words trigram)

Example, let's say I have this text that I want to summarize, and as a summary I will keep just the 5 most relevant N-grams:

"A more principled way to estimate sentence importance is using random walks 
and eigenvector centrality. LexRank[5] is an algorithm essentially identical 
to TextRank, and both use this approach for document summarization. The two 
methods were developed by different groups at the same time, and LexRank 
simply focused on summarization, but could just as easily be used for
keyphrase extraction or any other NLP ranking task." wikipedia

Then as an output I want to have, "random walks", "texRank", "lexRanks", "document summarization", "keyphrase extraction", " NLP ranking task"

In other words my is question : How to infer that a unigram will be more relevant than a bigram or trigram? (using just frequency as measure of the relevance of an N-gram will not give me the results that I want to have)

Can anyone point to me a research paper, an algorithm or a course where such a method has been already used or explained

Thank you in advance.

Felipe Martins Melo · Accepted Answer

Considering that you have a corpus, you can try using topic modeling technologies (such as Biterm) to help you inferring the most relevant terms to a given topic, being that your terms could also be n-grams. This would be a probabilistic approximation, since, as you mentioned, simply counting frequencies did not yield good results.

Of course, this approach considers lemmatization and stopwords removal.

Text summarization: how to choose the right n-gram size

Tags:

nlp

text-mining

data-mining

information-retrieval

summary

sel

1 Answers

Felipe Martins Melo

Recent Activity

Donate For Us

Text summarization: how to choose the right n-gram size

Tags:

nlp

text-mining

data-mining

information-retrieval

summary

sel

1 Answers

Felipe Martins Melo

Related questions

Recent Activity

Donate For Us