Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do people use n-grams for sentiment analysis, considering that as n increases, the memory requirement also increases rapidly?

I am trying to do Sentiment Analysis on Tweets using Python.

To begin with, I've implemented an n-grams model. So, lets say our training data is

I am a good kid

He is a good kid, but he didn't get along with his sister much

Unigrams:

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much>

Bigrams:

<(i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much)>

Trigrams:

<(i am a), (am a good), (a good kid), .........>

Final feature vector:

<i, am, a, good, kid, he, but, didnt, get, along, with, his, sister, much, (i am), (am a), (a good), (good kid), (he is), (is a), (kid but), (but he), (he didnt), (didnt get), (get along), (along with), (with his), (his sister), (sister much), (i am a), (am a good), (a good kid), .........>

When we do this for a large training data, of 8000 or so entries, the dimensionality of the feature vector becomes too HUGE, as a result of which, my computer (RAM=16GB) crashes.

So, when people mention using "n-grams" as features, in 100s of papers out there, what are they talking about? Am I doing something wrong?

Do people always do some feature selection for "n-grams"? If so, what kind of feature selection should I look into?

I am using scikit-learn to do this

like image 263
P.C. Avatar asked Nov 09 '14 03:11

P.C.


2 Answers

If you store your final feature vector exactly as you wrote, I think I could come up with some improvements.

The memory issue is due to the fact that features (the texts) are repeated so many times, and so are the tokens. Consider this process:

First of all, all the distinct features are stored (and given an index).

For example,

1--feature1--(i am)

2--feature2--(am a)

...

This generates a so-called feature space.

There might be thousands of features in total, or even more. But that should be normal. Then each entry could be rewrote as a serial of numbers such as,

Entry1----- <1,1,1,0,....a_n>, where the first 1 means feature1(i am) has 1 occurrence in this entry, and a_n is the number of occurrence of feature n.

Let's assume there are many features and the entries are short, which means in each vector there are way too many zeros. We can rewrite the previous vector as following,

Entry1----{1:1,2:1,3:1}, which means the value of feature 1/2/3 of Entry1 is 1, and values of all the other features are zeros. Shorter, isn't it?

In the end each entry is represented as a short vector, and you get a big matrix for your corpus. Your corpus might look like this now:

{1:1, 2:1, 3:1}

{2:1, 29:1, 1029:1, 20345:1}

...

16G RAM is sufficient for 8000 entries. You can use much less.


And further more, if you get too many distinct tokens (which means toooo many features). When constructing the feature space, what can be done is to remove features whose frequencies are lower than a threshold, say 3 times. The size of the feature space could be deducted to half, or even smaller.

like image 73
eastdog Avatar answered Nov 14 '22 03:11

eastdog


As inspectorG4dget said in the comments, you rarely go to the high n-grams, e.g. n=5 or n=6, because you will not have enough training data to make it worthwhile. In other words almost all of your 6-grams will have an occurrence count of 1. Also, to quote inspectorG4dget's comment:

When these papers talk about n-grams, they're not talking about a scalable n - they're USUALLY talking about a specific n (whose value might be revealed in the results or experiments section)

So, usually memory is not the biggest concern. With really large corpus you will split them across a cluster, then combine results at the end. You might split based on how much memory each node in the cluster has, or if processing a stream then you might stop and upload results (to the central node) each time you fill memory.

There are some optimizations you can do. If the corpus is being held in memory, then each n-gram only needs to be an index to the first occurrence in the corpus; the string does not need to be repeated.

A second optimization, if you don't mind multiple passes, is to use the (n-1)-gram result to skip over sentence parts below your threshold. E.g. if you are only interested in n-grams that occur 3+ times, and if "He is a clever" only had a count of 2 at the 4-gram analysis, then when you discover the 5-gram of "He is a clever dog" you can throw it away, as you know it only occurs once or twice. This is a memory optimization at the expense of extra CPU.

like image 29
Darren Cook Avatar answered Nov 14 '22 05:11

Darren Cook