Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly is an n Gram?

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:

Sentence: "I live in NY."  word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #' character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"  When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:  word level bigrams: [1, 1, 1, 1, 1] character level bigrams: [2, 1, 1, ...] 

Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.

What is a good nGram value and how should I take it into account when using a tool like LingPipe?

Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

like image 314
user2649614 Avatar asked Aug 12 '13 17:08

user2649614


People also ask

What is meant by n-gram?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

What an n-gram is and why it is helpful?

N-Grams are useful for turning written language into data, and breaking down larger portions of search data into more meaningful segments that help to identify the root cause behind trends.

What is n-gram extraction?

N-gram is one common and effective feature extraction approach used for text data representation and as a basis for many machine learning algorithms.

What is an n-gram language model?

N-gram Language Model: An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. A good N-gram model can predict the next word in the sentence i.e the value of p(w|h)


2 Answers

Usually a picture is worth thousand words. enter image description here

Source: http://recognize-speech.com/language-model/n-gram-model/comparison

like image 129
Kamran Avatar answered Nov 29 '22 08:11

Kamran


N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.

You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

like image 43
zoul Avatar answered Nov 29 '22 08:11

zoul