efficient count distinct across columns of DataFrame, grouped by rows

Tags:

What is the fastest way (within the limits of sane pythonicity) to count distinct values, across columns of the same dtype, for each row in a DataFrame?

Details: I have a DataFrame of categorical outcomes by subject (in rows) by day (in columns), similar to something generated by the following.

import numpy as np
import pandas as pd

def genSampleData(custCount, dayCount, discreteChoices):
    """generate example dataset"""
    np.random.seed(123)     
    return pd.concat([
               pd.DataFrame({'custId':np.array(range(1,int(custCount)+1))}),
               pd.DataFrame(
                columns = np.array(['day%d' % x for x in range(1,int(dayCount)+1)]),
                data = np.random.choice(a=np.array(discreteChoices), 
                                        size=(int(custCount), int(dayCount)))    
               )], axis=1)

For example, if the dataset tells us which drink each customer ordered on each visit to a store, I would like to know the count of distinct drinks per customer.

# notional discrete choice outcome          
drinkOptions, drinkIndex = np.unique(['coffee','tea','juice','soda','water'], 
                                     return_inverse=True) 

# integer-coded discrete choice outcomes
d = genSampleData(2,3, drinkIndex)
d
#   custId  day1  day2  day3
#0       1     1     4     1
#1       2     3     2     1

# Count distinct choices per subject -- this is what I want to do efficiently on larger DF
d.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#0    2
#1    3

# Note: I have coded the choices as `int` rather than `str` to speed up comparisons.
# To reconstruct the choice names, we could do:
# d.iloc[:,1:] = drinkOptions[d.iloc[:,1:]]

What I have tried: The datasets in this use case will have many more subjects than days (example testDf below), so I have tried to find the most efficient row-wise operation:

testDf = genSampleData(100000,3, drinkIndex)

#---- Original attempts ----
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: x.nunique(), axis=1)
# I didn't wait for this to finish -- something more than 5 seconds per loop
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(x.unique()), axis=1)
# Also too slow
%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1)
#20 loops, best of 3: 2.07 s per loop

To improve on my original attempt, we note that pandas.DataFrame.apply() accepts the argument:

If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance

This did cut the runtime by more than half:

%timeit -n20 testDf.iloc[:,1:].apply(lambda x: len(np.unique(x)), axis=1, raw=True)
#20 loops, best of 3: 721 ms per loop *best so far*

I was surprised that a pure numpy solution, which would seem to be equivalent to the above with raw=True, was actually a bit slower:

%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=1, arr = testDf.iloc[:,1:].values)
#20 loops, best of 3: 1.04 s per loop

Finally, I also tried transposing the data in order to do column-wise count distinct, which I thought might be more efficient (at least for DataFrame.apply(), but there didn't seem to be a meaningful difference.

%timeit -n20 testDf.iloc[:,1:].T.apply(lambda x: len(np.unique(x)), raw=True)
#20 loops, best of 3: 712 ms per loop *best so far*
%timeit -n20 np.apply_along_axis(lambda x: len(np.unique(x)), axis=0, arr = testDf.iloc[:,1:].values.T)
# 20 loops, best of 3: 1.13 s per loop

So far my best solution is a strange mix of df.apply of len(np.unique()), but what else should I try?

747

asked Aug 04 '16 14:08

C8H10N4O2

3 Answers

My understanding is that nunique is optimized for large series. Here, you have only 3 days. Comparing each column against the others seems to be faster:

testDf = genSampleData(100000,3, drinkIndex)
days = testDf.columns[1:]

%timeit testDf.iloc[:, 1:].stack().groupby(level=0).nunique()
10 loops, best of 3: 46.8 ms per loop

%timeit pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
10 loops, best of 3: 47.6 ms per loop

%%timeit
testDf['nunique'] = 1
for col1, col2 in zip(days, days[1:]):
    testDf['nunique'] += ~((testDf[[col2]].values == testDf.ix[:, 'day1':col1].values)).any(axis=1)
100 loops, best of 3: 3.83 ms per loop

It loses its edge when you add more columns of course. For different number of columns (the same order: stack().groupby(), pd.melt().groupby() and loops):

10 columns: 143ms, 161ms, 30.9ms
50 columns: 749ms, 968ms, 635ms
100 columns: 1.52s, 2.11s, 2.33s

122

answered Oct 23 '22 07:10

ayhan

pandas.melt with DataFrame.groupby and groupby.SeriesGroupBy.nunique seems to blow the other solutions away:

%timeit -n20 pd.melt(testDf, id_vars ='custId').groupby('custId').value.nunique()
#20 loops, best of 3: 67.3 ms per loop

answered Oct 23 '22 07:10

C8H10N4O2

You don't need custId. I'd stack, then groupby

testDf.iloc[:, 1:].stack().groupby(level=0).nunique()

enter image description here

answered Oct 23 '22 05:10

piRSquared

Related questions
                            
                                How to use pelican to generate a hierarchical website, not a blog
                            
                                error :document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
                            
                                Finding all repeated substrings in a string and how often they appear
                            
                                are there any limitations on the number of locks a python program can create?
                            
                                How to hstack several sparse matrices (feature matrices)?
                            
                                overlapping dates between two date ranges in python
                            
                                using an numpy array as indices of the 2nd dim of another array? [duplicate]
                            
                                How to run async process from handler in aiohttp
                            
                                Barplot/line plot on same plot, but different axis and line plot in front of barplot
                            
                                Plotly python how to draw unbounded lines and spans?
                            
                                Make two Frames occupy 50% of the available width each?
                            
                                Using alembic with multiple databases
                            
                                Dot product of two sparse matrices affecting zero values only
                            
                                use existing field as _id using elasticsearch dsl python DocType
                            
                                Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0
                            
                                Working with variable-length text in Tensorflow
                            
                                Python - Parameter checking with Exception Raising
                            
                                Pandas - Unpack column of lists of varying lengths of tuples
                            
                                Forecasting with statsmodels
                            
                                Python binding for MuJoCo physics library using mujoco-py package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

efficient count distinct across columns of DataFrame, grouped by rows

Tags:

performance

python

pandas

numpy

distinct-values

C8H10N4O2

People also ask

3 Answers

ayhan

C8H10N4O2

piRSquared

Recent Activity

Donate For Us