Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to count occurrences of all values in a pandas DataFrame

Suppose I have the following data:

import pandas as pd
import numpy as np
import random
from string import ascii_uppercase

random.seed(100)

n = 1000000

# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}

df = pd.DataFrame(data)

I want to quickly count the global occurrence of each value in the set of all values in the dataframe.

This works:

from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])

But is very slow:

%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop

I thought this function might speed things up by using some of pandas's horsepower:

def quick_global_count(df, na_value=-999):
    df = df.fillna(na_value)
    # Get counts of each element for each column in the passed dataframe
    group_bys = {c: df.groupby(c).size() for c in df}
    # Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
    stacked = pd.concat([v for k, v in group_bys.items()])
    # Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
    # Then groupby and sum on that index to get global counts
    global_counts = stacked.reset_index().groupby('index').sum()
    return global_counts

It's definitely faster (75% of the time of the previous approach), but there must be something quicker...

%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop

The results of the above two approaches are identical (with a little modification of the results returned by quick_global_count):

dict(c) == quick_global_count(df).to_dict()[0]
True

What's a quicker way of counting global occurrences of values in a dataframe?

like image 675
blacksite Avatar asked Oct 15 '17 20:10

blacksite


1 Answers

Approach #1

Well the NumPy trick would be to convert to numbers (that's where NumPy shines) and simply let bincount do the counting -

a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]

This works for single characters. np.bincount(a.ravel()) holds the count for all the characters.

Approach #1S (super-charged)

Previous approach had bottlenecks at the string conversion : astype(str). Also, the fillna() was another show-stopper. More trickery was needed to super-charge it by getting around those bottlenecks. Now, astype('S1') could be used upfront to force everything to single character. So, single characters stay put, while the NaNs get reduced to just a single character 'n'. This lets us skip fillna, as the count for 'n' could be simply skipped later on with indexing.

Hence, the implementation would be -

def app1S(df):
    ar = df.values.astype('S1')
    a = ar.view(np.uint8)
    count = np.bincount(a.ravel())[65:65+26]
    return count

Timings on pandas-0.20.3 and numpy-1.13.3 -

In [3]: # Setup input
   ...: random.seed(100)
   ...: n = 1000000
   ...: data = {letter: [random.choice(list(ascii_uppercase) + 
   ...:         [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
   ...: df = pd.DataFrame(data)
   ...: 

# @Wen's soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop

# @andrew_reece's soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop

# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop

Generic case

We can also np.unique to cover for generic cases (data with more than single characters) -

unq, count = np.unique(df.fillna(-999), return_counts=1)
like image 130
Divakar Avatar answered Nov 14 '22 21:11

Divakar