I've read a paper that uses ngram counts as feature for a classifier, and I was wondering what this exactly means.
Example text: "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam"
I can create unigrams, bigrams, trigrams, etc. out of this text, where I have to define on which "level" to create these unigrams. The "level" can be character, syllable, word, ...
So creating unigrams out of the sentence above would simply create a list of all words?
Creating bigrams would result in word pairs bringing together words that follow each other?
So if the paper talks about ngram counts, it simply creates unigrams, bigrams, trigrams, etc. out of the text, and counts how often which ngram occurs?
Is there an existing method in python's nltk package? Or do I have to implement a version of my own?
N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram). Well, that wasn't very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.
n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression.
I found my old code, maybe it's useful.
import nltk
from nltk import bigrams
from nltk import trigrams
text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam tempus vitae. Morbi justo mauris,
congue sit amet imperdiet ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam"""
# split the texts into tokens
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
tri_tokens = trigrams(tokens)
# print trigrams count
print [(item, tri_tokens.count(item)) for item in sorted(set(tri_tokens))]
>>>
[(('adipiscing', 'elit.', 'nullam'), 2), (('amet', 'consectetur', 'adipiscing'), 2),(('amet', 'imperdiet', 'ipsum'), 1), (('congue', 'sit', 'amet'), 1), (('consectetur', 'adipiscing', 'elit.'), 2), (('diam', 'tempus', 'vitae.'), 1), (('dolor', 'sit', 'amet'), 2), (('elit.', 'nullam', 'ornare'), 2), (('imperdiet', 'ipsum', 'dolor'), 1), (('ipsum', 'dolor', 'sit'), 2), (('justo', 'mauris', 'congue'), 1), (('lacus', 'quis', 'pellentesque'), 2), (('lorem', 'ipsum', 'dolor'), 1), (('mauris', 'congue', 'sit'), 1), (('morbi', 'justo', 'mauris'), 1), (('nullam', 'ornare', 'tempor'), 2), (('ornare', 'tempor', 'lacus'), 2), (('pellentesque', 'diam', 'tempus'), 1), (('quis', 'pellentesque', 'diam'), 2), (('sit', 'amet', 'consectetur'), 2), (('sit', 'amet', 'imperdiet'), 1), (('tempor', 'lacus', 'quis'), 2), (('tempus', 'vitae.', 'morbi'), 1), (('vitae.', 'morbi', 'justo'), 1)]
When you count n-grams, it's better to use hash table(dictionary) rather than using count. For the above example:
unigrams = {}
for token in tokens:
if token not in unigrams:
unigrams[token] = 1
else:
unigrams[token] += 1
this gives you time complexity O(n)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With