Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python faster alternative to dictionary? [duplicate]

I'm making a simple sentiment mining system using a Naive Bayes classifier.

For training my classifier, I have a text file where each line contains a list of tokens (generated from a tweet), and the associated sentiment (0 for -ve, 4 for positive).

For example:

0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that 's a bummer . You shoulda got David Carr of Third Day to do it . ; D
0 spring break in plain city ... it 's snowing
0 @ alydesigns i was out most of the day so did n't get much done
0 some1 hacked my account on aim now i have to make a new one
0 really do n't feel like getting up today ... but got to study to for tomorrows practical exam ...

Now, what I'm trying to do is for each token, count how many times it occurs in a positive tweet, and how many times it occurs in a negative tweet. I then plan to use these counts for calculating probabilities. I'm using the built-in dictionary for storing these counts. The keys are the tokens and the values are integer arrays of size 2.

The problem is that this code starts off pretty fast, but keeps getting slower and when it has processed around 200 thousand tweets, it gets really slow - around 1 tweet per second. Since my training set has 1.6 million tweets, this is too slow. The code I have is this:

def compute_counts(infile):
    f = open(infile)
    counts = {}
    i = 0
    for line in f:
        i = i + 1
        print(i)
        words = line.split(' ')
        for word in words[1:]:
            word = word.replace('\n', '').replace('\r', '')
            if words[0] == '0':
                if word in counts.keys():
                    counts[word][0] += 1
                else:
                    counts[word] = [1, 0]
            else:
                if word in counts.keys():
                    counts[word][1] += 1
                else:
                    counts[word] = [0, 1]
    return counts

What can I do to make this process faster? A better data structure?

Edit: Not a duplicate, the question is not about something faster than dict in the general case, but in this specific use case.

like image 979
hoodakaushal Avatar asked Sep 14 '14 13:09

hoodakaushal


1 Answers

Don't use if word in counts.keys() If you do that, you end up looking sequentially through the keys, which is what dict is supposed to avoid.

Just put if word in counts.

Or use a defaultdict. https://docs.python.org/2/library/collections.html#collections.defaultdict

like image 114
khelwood Avatar answered Nov 14 '22 20:11

khelwood