What is the fastest way (within the limits of sane pythonicity) to count distinct values, across columns of the same dtype
, for each row in a DataFrame
?
Details: I have a DataFrame
of categorical outcomes by subject (in rows) by day (in columns), similar to something generated by the following.
import numpy as np
import pandas as pd
def genSampleData(custCount, dayCount, discreteChoices):
"""generate example dataset"""
np.random.seed(123)
return pd.concat([
pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}),
pd.DataFrame(
columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]),
data = np.random.choice(a=np.array(discreteChoices),
size=(int(custCount), int(dayCount)))
)], axis=1)
For example, if the dataset tells us which drink each customer ordered on each visit to a store, I would like to know the count of distinct drinks per customer.
# notional discrete choice outcome
drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'],
return_inverse=True)
# integer-coded discrete choice outcomes
d = genSampleData(2,3, drinkIndex)
d
# custId day1 day2 day3
#0 1 1 4 1
#1 2 3 2 1
# Count distinct choices per subject -- this is what I want to do efficiently on larger DF
d.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#0 2
#1 3
# Note: I have coded the choices as `int` rather than `str` to speed up comparisons.
# To reconstruct the choice names, we could do:
# d.iloc[:,1:] = drinkOptions[d.iloc[:,1:]]
What I have tried: The datasets in this use case will have many more subjects than days (example testDf
below), so I have tried to find the most efficient row-wise operation:
testDf = genSampleData(100000,3, drinkIndex)
#---- Original attempts ----
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: x.nunique(), axis=1)
# I didn't wait for this to finish -- something more than 5 seconds per loop
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(x.unique()), axis=1)
# Also too slow
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#20 loops, best of 3: 2.07 s per loop
To improve on my original attempt, we note that pandas.DataFrame.apply() accepts the argument:
If
raw=True
the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance
This did cut the runtime by more than half:
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True)
#20 loops, best of 3: 721 ms per loop *best so far*
I was surprised that a pure numpy solution, which would seem to be equivalent to the above with raw=True
, was actually a bit slower:
%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values)
#20 loops, best of 3: 1.04 s per loop
Finally, I also tried transposing the data in order to do column-wise count distinct, which I thought might be more efficient (at least for DataFrame.apply()
, but there didn't seem to be a meaningful difference.
%timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True)
#20 loops, best of 3: 712 ms per loop *best so far*
%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=0, arr = testDf.iloc[:,1:].values.T)
# 20 loops, best of 3: 1.13 s per loop
So far my best solution is a strange mix of df.apply
of len(np.unique())
, but what else should I try?
Method 1: Count unique values using nunique() The Pandas dataframe. nunique() function returns a series with the specified axis's total number of unique observations. The total number of distinct observations over the index axis is discovered if we set the value of the axis to 0.
To count unique values per groups in Python Pandas, we can use df. groupby('column_name'). count().
To count unique values in the pandas dataframe column use Series. unique() function and then call the size to get the count. Series.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
My understanding is that nunique is optimized for large series. Here, you have only 3 days. Comparing each column against the others seems to be faster:
testDf = genSampleData(100000,3, drinkIndex)
days = testDf.columns[1:]
%timeit testDf.iloc[:, 1:].stack().groupby(level=0).nunique()
10 loops, best of 3: 46.8 ms per loop
%timeit pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
10 loops, best of 3: 47.6 ms per loop
%%timeit
testDf['nunique'] = 1
for col1, col2 in zip(days, days[1:]):
testDf['nunique'] += ~((testDf[[col2]].values == testDf.ix[:, 'day1':col1].values)).any(axis=1)
100 loops, best of 3: 3.83 ms per loop
It loses its edge when you add more columns of course. For different number of columns (the same order: stack().groupby()
, pd.melt().groupby()
and loops):
10 columns: 143ms, 161ms, 30.9ms
50 columns: 749ms, 968ms, 635ms
100 columns: 1.52s, 2.11s, 2.33s
pandas.melt
with DataFrame.groupby
and groupby.SeriesGroupBy.nunique
seems to blow the other solutions away:
%timeit -n20 pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
#20 loops, best of 3: 67.3 ms per loop
You don't need custId
. I'd stack
, then groupby
testDf.iloc[:, 1:].stack().groupby(level=0).nunique()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With