How can I count frequency of two items using two columns using Pandas

Question

I have a very large file (5GB), and I need to count the number of occurence using two columns

     a   b   c   d   e
0    2   3   1   5   4
1    2   3   2   5   4
2    1   3   2   5   4
3    2   4   1   5   3
4    2   4   1   5   3

so obviously I have to find

(2,3):2
(1,3):1
(2,4):2

How can I do that in a very fast way.

I used:

df.groupby(['a','b']).count().to_dict()

Let's say that the final result would be

a b freq
2 3 2
1 3 1
2 4 2

Mike Müller · Accepted Answer

Approach for the first version of the question - dictionary as result

If you have high frequencies, i.e. few combinations of a and b, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.

If you have low frequencies and enough RAM, looks like your approach is good.

Some timings for 5e6 rows and numbers from 0 to 19:

>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)

%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 552 ms per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop

Using a different range of integers, here up to sys.maxsize (9223372036854775807), changes the timings considerably:

import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)), 
                  columns=list('abcde'))


%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 41.3 s per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`

Solution for the updated question

df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
print(g)

   a  b  feq
0  1  3    1
1  2  3    2
2  2  4    2

It is not much faster though.

For range 0 to 19:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 

1 loops, best of 3: 564 ms per loop

For range 0 to sys.maxsize:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
1 loops, best of 3: 10.2 s per loop

How can I count frequency of two items using two columns using Pandas

Tags:

python

pandas

user3378649

1 Answers

Approach for the first version of the question - dictionary as result

Solution for the updated question

Mike Müller

Recent Activity

Donate For Us

How can I count frequency of two items using two columns using Pandas

Tags:

python

pandas

user3378649

1 Answers

Approach for the first version of the question - dictionary as result

Solution for the updated question

Mike Müller

Related questions

Recent Activity

Donate For Us