Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I count frequency of two items using two columns using Pandas

Tags:

python

pandas

I have a very large file (5GB), and I need to count the number of occurence using two columns

     a   b   c   d   e
0    2   3   1   5   4
1    2   3   2   5   4
2    1   3   2   5   4
3    2   4   1   5   3
4    2   4   1   5   3

so obviously I have to find

(2,3):2
(1,3):1
(2,4):2

How can I do that in a very fast way.

I used:

df.groupby(['a','b']).count().to_dict() 

Let's say that the final result would be

a b freq
2 3 2
1 3 1
2 4 2
like image 914
user3378649 Avatar asked Dec 07 '25 05:12

user3378649


1 Answers

Approach for the first version of the question - dictionary as result

If you have high frequencies, i.e. few combinations of a and b, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.

If you have low frequencies and enough RAM, looks like your approach is good.

Some timings for 5e6 rows and numbers from 0 to 19:

>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)

%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 552 ms per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop

Using a different range of integers, here up to sys.maxsize (9223372036854775807), changes the timings considerably:

import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)), 
                  columns=list('abcde'))


%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 41.3 s per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`

Solution for the updated question

df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
print(g)

   a  b  feq
0  1  3    1
1  2  3    2
2  2  4    2

It is not much faster though.

For range 0 to 19:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 

1 loops, best of 3: 564 ms per loop

For range 0 to sys.maxsize:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
1 loops, best of 3: 10.2 s per loop
like image 58
Mike Müller Avatar answered Dec 08 '25 19:12

Mike Müller



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!