How to count distinct values in a combination of columns while grouping by in pandas?

Tags:

I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.

For example I have the following data frame:

   a   b    c     d      e
0  1  10  100  1000  10000
1  1  10  100  1000  20000
2  1  20  100  1000  20000
3  1  20  100  2000  20000

I can group it by columns a and b and count distinct values in the column d:

df.groupby(['a','b'])['d'].nunique().reset_index()

As a result I get:

   a   b  d
0  1  10  1
1  1  20  2

However, I would like to count distinct values in a combination of columns. For example if I use c and d, then in the first group I have only one unique combination ((100, 1000)) while in the second group I have two distinct combinations: (100, 1000) and (100, 2000).

The following naive "generalization" does not work:

df.groupby(['a','b'])[['c','d']].nunique().reset_index()

because nunique() is not applicable to data frames.

441

asked Nov 25 '16 13:11

Roman

1 Answers

You can create combination of values converting to string to new column e and then use SeriesGroupBy.nunique:

df['e'] = df.c.astype(str) + df.d.astype(str)
df = df.groupby(['a','b'])['e'].nunique().reset_index()
print (df)
   a   b  e
0  1  10  1
1  1  20  2

You can also use Series without creating new column:

df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another posible solution is create tuples:

df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another numpy solution by this answer:

def f(x):
    a = x.values
    c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1])
    return c

print (df.groupby(['a','b'])[['c','d']].apply(f))

Timings:

#[1000000 rows x 5 columns]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30, size=(N,5)))
df.columns = list('abcde')
print (df)  

In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum()))
1 loop, best of 3: 663 ms per loop

In [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f))
1 loop, best of 3: 387 ms per loop

In [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size())
1 loop, best of 3: 441 ms per loop

In [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 4.95 s per loop

In [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 17.6 s per loop

answered Sep 28 '22 17:09

jezrael

Related questions
                            
                                How to fix Selenium WebDriverException: "The browser appears to have exited"
                            
                                Pandas: No column names in data file
                            
                                How to calculate the similarity of English words that do not appear in WordNet?
                            
                                How I can calculate standard deviation for rows of a dataframe?
                            
                                Split string of digits into lists of even and odd integers
                            
                                tensorflow : .eval() never ends
                            
                                Convert PIL image to bytearray
                            
                                KeyError: nan in dict
                            
                                Categorize Data in a column in dataframe
                            
                                Python 2.7 [Errno 113] No route to host
                            
                                Randomly distribute files into train/test given a ratio
                            
                                Listing more than 100 stacks using boto3
                            
                                Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?
                            
                                How to scrape real time streaming data with Python?
                            
                                PyQt - Reducing margins and spacing in widget *expands* layout
                            
                                Python bcrypt package on Heroku gives AttributeError: 'module' object has no attribute 'ffi'
                            
                                Different class instances use same memory location
                            
                                Do not internationalize/translate URLs
                            
                                How to update a file using PyGithub?
                            
                                class Image has no attribute 'fromarray'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count distinct values in a combination of columns while grouping by in pandas?

Tags:

python

pandas

group-by

Roman

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us