I have a big <code>DF</code> with <code>10 millions</code> rows and I need to find the unique number for each column. I wrote the function below: (need to return a series) <pre class="prettyprint lang-py prettyprint-override"><code>def count_unique_values(df): return pd.Series(df.nunique()) </code></pre> and I get this output: <pre class="prettyprint lang-py prettyprint-override"><code>Area 210 Item 436 Element 4 Year 53 Unit 2 Value 313640 dtype: int64 </code></pre> expected result should be value 313641. when I just do <pre class="prettyprint lang-py prettyprint-override"><code>df['Value'].unique() </code></pre> I do get that answer. Didn't figure out why I get less with <code>nunique()</code> just there.

Because <code>DataFrame.nunique</code> omit missing values, because default parameter <code>dropna=True</code>, <code>Series.unique</code> function not. Sample: <pre class="prettyprint"><code>df = pd.DataFrame({ 'A':list('abcdef'), 'D':[np.nan,3,5,5,3,5], }) print (df) A D 0 a NaN 1 b 3.0 2 c 5.0 3 d 5.0 4 e 3.0 5 f 5.0 def count_unique_values(df): return df.nunique() print (count_unique_values(df)) A 6 D 2 dtype: int64 print (df['D'].unique()) [nan 3. 5.] </code></pre> <hr> <pre class="prettyprint"><code>print (df['D'].nunique()) 2 print (df['D'].unique()) [nan 3. 5.] </code></pre> <hr> Solution is add parameter <code>dropna=False</code>: <pre class="prettyprint"><code>print (df['D'].nunique(dropna=False)) 3 print (df['D'].unique()) 3 </code></pre> So in your function: <pre class="prettyprint"><code>def count_unique_values(df): return df.nunique(dropna=False) print (count_unique_values(df)) A 6 D 3 dtype: int64 </code></pre>

Different outcome using pandas nunique() and unique()

Tags:

pandas

dataframe

unique

I have a big DF with 10 millions rows and I need to find the unique number for each column.

I wrote the function below: (need to return a series)

def count_unique_values(df):
    return pd.Series(df.nunique())

and I get this output:

Area          210
Item          436
Element         4
Year           53
Unit            2
Value      313640
dtype: int64

expected result should be value 313641.

when I just do

df['Value'].unique()

I do get that answer. Didn't figure out why I get less with nunique() just there.

675

asked May 26 '19 05:05

ShayHa

1 Answers

Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.

Sample:

df = pd.DataFrame({
        'A':list('abcdef'),
        'D':[np.nan,3,5,5,3,5],

})

print (df)
   A    D
0  a  NaN
1  b  3.0
2  c  5.0
3  d  5.0
4  e  3.0
5  f  5.0

def count_unique_values(df):
    return df.nunique()

print (count_unique_values(df))
A    6
D    2
dtype: int64

print (df['D'].unique())
[nan  3.  5.]

print (df['D'].nunique())
2

print (df['D'].unique())
[nan  3.  5.]

Solution is add parameter dropna=False:

print (df['D'].nunique(dropna=False))
3

print (df['D'].unique())
3

So in your function:

def count_unique_values(df):
    return df.nunique(dropna=False)
print (count_unique_values(df))
A    6
D    3
dtype: int64

189

answered Sep 20 '22 01:09

jezrael

Related questions
                            
                                Python custom function using rolling_apply for pandas
                            
                                percentile rank in pandas in groups
                            
                                Pandas Seaborn Install
                            
                                Pandas resample by first day in my data
                            
                                Data Conversion Error while applying a function to each row in pandas Python
                            
                                Pandas: Is there a way to use something like 'droplevel' and in process, rename the other level using the dropped level labels as prefix/suffix?
                            
                                Replace duplicate values across columns in Pandas
                            
                                Select data when specific columns have null value in pandas
                            
                                How to drop the index column while writing the DataFrame in a .csv file in Pandas? [duplicate]
                            
                                Pandas use and operator in LOC function
                            
                                How can I see the formulas of an excel spreadsheet in pandas / python?
                            
                                How can I create a DataFrame slice object piece by piece?
                            
                                Pandas GroupBy: apply a function with two arguments
                            
                                Error"Can only compare identically-labeled Series objects" and sort_index
                            
                                Preserve NaN values in pandas boolean comparisons
                            
                                Is there a quick way to turn a pandas DataFrame into a pretty HTML table?
                            
                                How to display a pandas dataframe as datatable?
                            
                                Efficient way to add new column to pandas dataframe
                            
                                pandas.read_feather got an unexpected argument nthreads
                            
                                Is it possible to use pandas.DataFrame.rolling with a step greater than 1?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With