Suppose I have the following data:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
I want to quickly count the global occurrence of each value in the set of all values in the dataframe.
This works:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
But is very slow:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
I thought this function might speed things up by using some of pandas's horsepower:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
# Get counts of each element for each column in the passed dataframe
group_bys = {c: df.groupby(c).size() for c in df}
# Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
stacked = pd.concat([v for k, v in group_bys.items()])
# Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
# Then groupby and sum on that index to get global counts
global_counts = stacked.reset_index().groupby('index').sum()
return global_counts
It's definitely faster (75% of the time of the previous approach), but there must be something quicker...
%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop
The results of the above two approaches are identical (with a little modification of the results returned by quick_global_count
):
dict(c) == quick_global_count(df).to_dict()[0]
True
What's a quicker way of counting global occurrences of values in a dataframe?
Approach #1
Well the NumPy trick would be to convert to numbers (that's where NumPy shines) and simply let bincount
do the counting -
a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]
This works for single characters. np.bincount(a.ravel())
holds the count for all the characters.
Approach #1S (super-charged)
Previous approach had bottlenecks at the string conversion : astype(str)
. Also, the fillna()
was another show-stopper. More trickery was needed to super-charge it by getting around those bottlenecks. Now, astype('S1')
could be used upfront to force everything to single character. So, single characters stay put, while the NaNs get reduced to just a single character 'n'
. This lets us skip fillna
, as the count for 'n'
could be simply skipped later on with indexing.
Hence, the implementation would be -
def app1S(df):
ar = df.values.astype('S1')
a = ar.view(np.uint8)
count = np.bincount(a.ravel())[65:65+26]
return count
Timings on pandas-0.20.3
and numpy-1.13.3
-
In [3]: # Setup input
...: random.seed(100)
...: n = 1000000
...: data = {letter: [random.choice(list(ascii_uppercase) +
...: [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
...: df = pd.DataFrame(data)
...:
# @Wen's soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop
# @andrew_reece's soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop
# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop
Generic case
We can also np.unique
to cover for generic cases (data with more than single characters) -
unq, count = np.unique(df.fillna(-999), return_counts=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With