Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding count of distinct elements in DataFrame in each column

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

import pandas as pd import numpy as np  # Generate data. NROW = 10000 NCOL = 100 df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),                   columns=['col' + x for x in np.arange(NCOL).astype(str)]) 

I need to count the number of distinct elements for each column, like this:

col0    9538 col1    9505 col2    9524 

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?


Based upon the answers, df.apply(lambda x: len(x.unique())) is the fastest (notebook).

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

like image 594
ajknzhol Avatar asked May 28 '15 10:05

ajknzhol


People also ask

How do I get the number of unique values in each column in pandas?

To get a count of unique values in a column use pandas, first use Series. unique() function to get unique values from column by removing duplidate values and then call the size to get the count. unique() function returns a ndarray with unique value in order of appearance and the results are not sorted.

How do I count the number of unique values in a column?

You can use the combination of the SUM and COUNTIF functions to count unique values in Excel. The syntax for this combined formula is = SUM(IF(1/COUNTIF(data, data)=1,1,0)). Here the COUNTIF formula counts the number of times each value in the range appears.

How do I get unique values in multiple columns in Python?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.


2 Answers

As of pandas 0.20 we can use nunique directly on DataFrames, i.e.:

df.nunique() a    4 b    5 c    1 dtype: int64 

Other legacy options:

You could do a transpose of the df and then using apply call nunique row-wise:

In [205]: df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]}) df  Out[205]:    a  b  c 0  0  1  1 1  1  2  1 2  1  3  1 3  2  4  1 4  3  5  1  In [206]: df.T.apply(lambda x: x.nunique(), axis=1)  Out[206]: a    4 b    5 c    1 dtype: int64 

EDIT

As pointed out by @ajcr the transpose is unnecessary:

In [208]: df.apply(pd.Series.nunique)  Out[208]: a    4 b    5 c    1 dtype: int64 
like image 112
EdChum Avatar answered Oct 13 '22 23:10

EdChum


A Pandas.Series has a .value_counts() function that provides exactly what you want to. Check out the documentation for the function.

like image 36
CaMaDuPe85 Avatar answered Oct 13 '22 23:10

CaMaDuPe85