I am working on summarizing texts, using nltk library I am able to extract bigrams unigrams and trigrams and order them by frequency
As I am very new to this area (NLP) I was wondering if I can use a statistical model that will allow me to automatically choose the right size of Ngrams (what I mean by size the length of the N-gram one word unigram, two words bigram, or 3 words trigram)
Example, let's say I have this text that I want to summarize, and as a summary I will keep just the 5 most relevant N-grams:
"A more principled way to estimate sentence importance is using random walks
and eigenvector centrality. LexRank[5] is an algorithm essentially identical
to TextRank, and both use this approach for document summarization. The two
methods were developed by different groups at the same time, and LexRank
simply focused on summarization, but could just as easily be used for
keyphrase extraction or any other NLP ranking task." wikipedia
Then as an output I want to have, "random walks", "texRank", "lexRanks", "document summarization", "keyphrase extraction", " NLP ranking task"
In other words my is question : How to infer that a unigram will be more relevant than a bigram or trigram? (using just frequency as measure of the relevance of an N-gram will not give me the results that I want to have)
Can anyone point to me a research paper, an algorithm or a course where such a method has been already used or explained
Thank you in advance.
Considering that you have a corpus, you can try using topic modeling technologies (such as Biterm) to help you inferring the most relevant terms to a given topic, being that your terms could also be n-grams. This would be a probabilistic approximation, since, as you mentioned, simply counting frequencies did not yield good results.
Of course, this approach considers lemmatization and stopwords removal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With