I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct: <pre class="prettyprint"><code>Sentence: "I live in NY." word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #' character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#" When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency: word level bigrams: [1, 1, 1, 1, 1] character level bigrams: [2, 1, 1, ...] </code></pre> Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why. What is a good nGram value and how should I take it into account when using a tool like LingPipe? Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

Usually a picture is worth thousand words. <img src="https://i.stack.imgur.com/8ARA1.png" alt="enter image description here"> Source: http://recognize-speech.com/language-model/n-gram-model/comparison

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word <code>fox</code>, all 2-grams (or “bigrams”) are <code>fo</code> and <code>ox</code>. You may also count the word boundary – that would expand the list of 2-grams to <code>#f</code>, <code>fo</code>, <code>ox</code>, and <code>x#</code>, where <code>#</code> denotes a word boundary. You can do the same on the word level. As an example, the <code>hello, world!</code> text contains the following word-level bigrams: <code># hello</code>, <code>hello world</code>, <code>world #</code>. The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

What exactly is an n Gram?

Tags:

sentiment-analysis

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct:

Sentence: "I live in NY."  word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #' character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"  When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:  word level bigrams: [1, 1, 1, 1, 1] character level bigrams: [2, 1, 1, ...]

Someone in the answer section confirmed this was correct, but unfortunately I'm a bit lost beyond that as I didn't fully understand everything else that was said! I'm using LingPipe and following a tutorial which stated I should choose a value between 7 and 12 - but without stating why.

What is a good nGram value and how should I take it into account when using a tool like LingPipe?

Edit: This was the tutorial: http://cavajohn.blogspot.co.uk/2013/05/how-to-sentiment-analysis-of-tweets.html

314

asked Aug 12 '13 17:08

user2649614

2 Answers

Usually a picture is worth thousand words. enter image description here

Source: http://recognize-speech.com/language-model/n-gram-model/comparison

129

answered Nov 29 '22 08:11

Kamran

N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text. For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox. You may also count the word boundary – that would expand the list of 2-grams to #f, fo, ox, and x#, where # denotes a word boundary.

You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hello, hello world, world #.

The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. The longer the n-gram (the higher the n), the more context you have to work with. Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.

answered Nov 29 '22 08:11

zoul

Related questions
                            
                                how are sentiment analysis computed in blob
                            
                                Is it possible to edit NLTK's vader sentiment lexicon?
                            
                                German Stemming for Sentiment Analysis in Python NLTK
                            
                                Python - Sentiment Analysis using Pointwise Mutual Information
                            
                                Perl or Java Sentiment Analysis
                            
                                Classification using movie review corpus in NLTK/Python
                            
                                Logical fallacy detection and/or identification with natural-language-processing
                            
                                Sentiment Analysis of Entity (Entity-level Sentiment Analysis)
                            
                                calculate accuracy and precision of confusion matrix in R
                            
                                List of Natural Language Processing Tools in Regards to Sentiment Analysis - Which one do you recommend [closed]
                            
                                'Can't return head of null or leaf Tree' with CoreNLP on Android
                            
                                How to train the Stanford NLP Sentiment Analysis tool
                            
                                Good dataset for sentiment analysis? [closed]
                            
                                AttributeError: 'float' object has no attribute 'lower'
                            
                                Sentiment Analysis using tensorflow
                            
                                Emoticons in Twitter Sentiment Analysis in r
                            
                                Machine Learning (tensorflow / sklearn) in Django?
                            
                                Best Algorithmic Approach to Sentiment Analysis [closed]
                            
                                Sentiment analysis using R [closed]
                            
                                Stanford nlp for python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With