I've read a paper that uses ngram counts as feature for a classifier, and I was wondering what this exactly means. Example text: "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam" I can create unigrams, bigrams, trigrams, etc. out of this text, where I have to define on which "level" to create these unigrams. The "level" can be character, syllable, word, ... So creating unigrams out of the sentence above would simply create a list of all words? Creating bigrams would result in word pairs bringing together words that follow each other? So if the paper talks about ngram counts, it simply creates unigrams, bigrams, trigrams, etc. out of the text, and counts how often which ngram occurs? Is there an existing method in python's nltk package? Or do I have to implement a version of my own?

When you count n-grams, it's better to use hash table(dictionary) rather than using count. For the above example: <pre class="prettyprint"><code>unigrams = {} for token in tokens: if token not in unigrams: unigrams[token] = 1 else: unigrams[token] += 1 </code></pre> this gives you time complexity O(n)

What are ngram counts and how to implement using nltk?

2 Answers

I found my old code, maybe it's useful.

import nltk
from nltk import bigrams
from nltk import trigrams

text="""Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam tempus vitae. Morbi justo mauris,
congue sit amet imperdiet ipsum dolor sit amet, consectetur adipiscing elit. Nullam ornare
tempor lacus, quis pellentesque diam"""
# split the texts into tokens
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1] #same as unigrams
bi_tokens = bigrams(tokens)
tri_tokens = trigrams(tokens)

# print trigrams count

print [(item, tri_tokens.count(item)) for item in sorted(set(tri_tokens))]
>>> 
[(('adipiscing', 'elit.', 'nullam'), 2), (('amet', 'consectetur', 'adipiscing'), 2),(('amet', 'imperdiet', 'ipsum'), 1), (('congue', 'sit', 'amet'), 1), (('consectetur', 'adipiscing', 'elit.'), 2), (('diam', 'tempus', 'vitae.'), 1), (('dolor', 'sit', 'amet'), 2), (('elit.', 'nullam', 'ornare'), 2), (('imperdiet', 'ipsum', 'dolor'), 1), (('ipsum', 'dolor', 'sit'), 2), (('justo', 'mauris', 'congue'), 1), (('lacus', 'quis', 'pellentesque'), 2), (('lorem', 'ipsum', 'dolor'), 1), (('mauris', 'congue', 'sit'), 1), (('morbi', 'justo', 'mauris'), 1), (('nullam', 'ornare', 'tempor'), 2), (('ornare', 'tempor', 'lacus'), 2), (('pellentesque', 'diam', 'tempus'), 1), (('quis', 'pellentesque', 'diam'), 2), (('sit', 'amet', 'consectetur'), 2), (('sit', 'amet', 'imperdiet'), 1), (('tempor', 'lacus', 'quis'), 2), (('tempus', 'vitae.', 'morbi'), 1), (('vitae.', 'morbi', 'justo'), 1)]

168

answered Sep 17 '22 12:09

root

When you count n-grams, it's better to use hash table(dictionary) rather than using count. For the above example:

unigrams = {}
for token in tokens:
  if token not in unigrams:
    unigrams[token] = 1
  else:
    unigrams[token] += 1

this gives you time complexity O(n)

answered Sep 18 '22 12:09

Sheng

Related questions
                            
                                How do i parse a string in python and write it as an xml to a new xml file?
                            
                                Is it ok to spawn threads in a wsgi-application?
                            
                                Pythonwin - print function not working [duplicate]
                            
                                Python: Getting files into an archive without the directory?
                            
                                PyPy significantly slower than CPython
                            
                                Iterate through words of a file in Python
                            
                                High performance mass short string search in Python
                            
                                subprocess.Popen() IO redirect
                            
                                Close all open files in ipython
                            
                                Speed of Python Extensions in C vs. C
                            
                                A fast python HTML parser [closed]
                            
                                python JSON array newlines
                            
                                Python class constructor with default arguments [duplicate]
                            
                                Can I override a C++ virtual function within Python with Cython?
                            
                                Is there a decent way of creating a copy constructor in python?
                            
                                Determining if a GIF is transparent in Python
                            
                                how do I properly inherit from a superclass that has a __new__ method?
                            
                                Automatic python code formatting in sublime
                            
                                How can I check the value of a DNS TXT record for a host?
                            
                                How to compile static library with -fPIC from boost.python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are ngram counts and how to implement using nltk?

Tags:

python

nlp

nltk

akohout

People also ask

2 Answers

root

Sheng

Recent Activity

Donate For Us