Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different outcome using pandas nunique() and unique()

I have a big DF with 10 millions rows and I need to find the unique number for each column.

I wrote the function below: (need to return a series)

def count_unique_values(df):
    return pd.Series(df.nunique())

and I get this output:

Area          210
Item          436
Element         4
Year           53
Unit            2
Value      313640
dtype: int64

expected result should be value 313641.

when I just do

df['Value'].unique()

I do get that answer. Didn't figure out why I get less with nunique() just there.

like image 675
ShayHa Avatar asked May 26 '19 05:05

ShayHa


People also ask

What is difference between unique and Nunique in pandas?

The output of number of unique values is returned. In this example, length of array returned by unique() method is compared to integer returned by nunique() method. Output: The output is not same in both of the cases as dropna parameter is set to True and hence NULL values were excluded while counting unique values.

What is Nunique () in pandas?

Pandas DataFrame nunique() Method The nunique() method returns the number of unique values for each column. By specifying the column axis ( axis='columns' ), the nunique() method searches column-wise and returns the number of unique values for each row.

How do I get the number of unique values in a column in pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do pandas use unique?

As I've already mentioned dataframe columns are essentially Pandas Series objects. If you want to use the unique() method on a dataframe column, you can do so as follows: Type the name of the dataframe, then use “dot syntax” and type the name of the column. Then use dot syntax to call the unique() method.


1 Answers

Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.

Sample:

df = pd.DataFrame({
        'A':list('abcdef'),
        'D':[np.nan,3,5,5,3,5],

})

print (df)
   A    D
0  a  NaN
1  b  3.0
2  c  5.0
3  d  5.0
4  e  3.0
5  f  5.0

def count_unique_values(df):
    return df.nunique()

print (count_unique_values(df))
A    6
D    2
dtype: int64

print (df['D'].unique())
[nan  3.  5.]

print (df['D'].nunique())
2

print (df['D'].unique())
[nan  3.  5.]

Solution is add parameter dropna=False:

print (df['D'].nunique(dropna=False))
3

print (df['D'].unique())
3

So in your function:

def count_unique_values(df):
    return df.nunique(dropna=False)
print (count_unique_values(df))
A    6
D    3
dtype: int64
like image 189
jezrael Avatar answered Sep 20 '22 01:09

jezrael