I am trying to find the count of distinct values in each column using Pandas. This is what I did. <pre class="prettyprint"><code>import pandas as pd import numpy as np # Generate data. NROW = 10000 NCOL = 100 df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)), columns=['col' + x for x in np.arange(NCOL).astype(str)]) </code></pre> I need to count the number of distinct elements for each column, like this: <pre class="prettyprint"><code>col0 9538 col1 9505 col2 9524 </code></pre> What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB? <hr> Based upon the answers, <code>df.apply(lambda x: len(x.unique()))</code> is the fastest (notebook). <code>%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop </code>

As of pandas 0.20 we can use <code>nunique</code> directly on <code>DataFrame</code>s, i.e.: <pre class="prettyprint"><code>df.nunique() a 4 b 5 c 1 dtype: int64 </code></pre> Other legacy options: You could do a transpose of the df and then using <code>apply</code> call <code>nunique</code> row-wise: <pre class="prettyprint"><code>In [205]: df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]}) df Out[205]: a b c 0 0 1 1 1 1 2 1 2 1 3 1 3 2 4 1 4 3 5 1 In [206]: df.T.apply(lambda x: x.nunique(), axis=1) Out[206]: a 4 b 5 c 1 dtype: int64 </code></pre> EDIT As pointed out by @ajcr the transpose is unnecessary: <pre class="prettyprint"><code>In [208]: df.apply(pd.Series.nunique) Out[208]: a 4 b 5 c 1 dtype: int64 </code></pre>

A <code>Pandas.Series</code> has a <code>.value_counts()</code> function that provides exactly what you want to. Check out the documentation for the function.

Finding count of distinct elements in DataFrame in each column

Tags:

python

pandas

numpy

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

import pandas as pd import numpy as np  # Generate data. NROW = 10000 NCOL = 100 df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),                   columns=['col' + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:

col0    9538 col1    9505 col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

Based upon the answers, df.apply(lambda x: len(x.unique())) is the fastest (notebook).

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

594

asked May 28 '15 10:05

ajknzhol

2 Answers

As of pandas 0.20 we can use nunique directly on DataFrames, i.e.:

df.nunique() a    4 b    5 c    1 dtype: int64

Other legacy options:

You could do a transpose of the df and then using apply call nunique row-wise:

In [205]: df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]}) df  Out[205]:    a  b  c 0  0  1  1 1  1  2  1 2  1  3  1 3  2  4  1 4  3  5  1  In [206]: df.T.apply(lambda x: x.nunique(), axis=1)  Out[206]: a    4 b    5 c    1 dtype: int64

EDIT

As pointed out by @ajcr the transpose is unnecessary:

In [208]: df.apply(pd.Series.nunique)  Out[208]: a    4 b    5 c    1 dtype: int64

112

answered Oct 13 '22 23:10

EdChum

A Pandas.Series has a .value_counts() function that provides exactly what you want to. Check out the documentation for the function.

answered Oct 13 '22 23:10

CaMaDuPe85

Related questions
                            
                                What is the difference between __init__.py and __main__.py? [duplicate]
                            
                                Is there an R equivalent of the pythonic "if __name__ == "__main__": main()"?
                            
                                Python: How to show matplotlib in flask [duplicate]
                            
                                Using Numpy Vectorize on Functions that Return Vectors
                            
                                Why is variable1 += variable2 much faster than variable1 = variable1 + variable2?
                            
                                How to rearrange array based upon index array
                            
                                Using Merge on a column and Index in Pandas
                            
                                Returning multiple values from pandas apply on a DataFrame
                            
                                Why is startswith slower than slicing
                            
                                Apply function to pandas groupby
                            
                                Relative import in Python 3 is not working [duplicate]
                            
                                How to handle a broken pipe (SIGPIPE) in python?
                            
                                How to unquote a urlencoded unicode string in python?
                            
                                Permission problems when creating a dir with os.makedirs in Python
                            
                                Python pandas: Add a column to my dataframe that counts a variable
                            
                                constants in Python: at the root of the module or in a namespace inside the module?
                            
                                Automatic version number both in setup.py (setuptools) AND source code?
                            
                                Replace value for a selected cell in pandas DataFrame without using index
                            
                                .pyw files in python program
                            
                                python, unittest: is there a way to pass command line options to the app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With