I have dataframe in pandas:
In [10]: df
Out[10]:
col_a col_b col_c col_d
0 France Paris 3 4
1 UK Londo 4 5
2 US Chicago 5 6
3 UK Bristol 3 3
4 US Paris 8 9
5 US London 44 4
6 US Chicago 12 4
I need to count unique cities. I can count unique states
In [11]: df['col_a'].nunique()
Out[11]: 3
and I can try to count unique cities
In [12]: df['col_b'].nunique()
Out[12]: 5
but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:
In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']
In [14]: df
Out[14]:
col_a col_b col_c col_d col_a_b
0 France Paris 3 4 France - Paris
1 UK Londo 4 5 UK - Londo
2 US Chicago 5 6 US - Chicago
3 UK Bristol 3 3 UK - Bristol
4 US Paris 8 9 US - Paris
5 US London 44 4 US - London
6 US Chicago 12 4 US - Chicago
In [15]: df['col_a_b'].nunique()
Out[15]: 6
Maybe there is a better way? Without creating an additional column.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
The nunique() function Here, df is the dataframe for which you want to know the unique counts. It returns a pandas Series of counts. By default, the pandas dataframe nunique() function counts the distinct values along axis=0 , that is, row-wise which gives you the count of distinct values in each column.
Count Unique Rows in Pandas DataFrame Using nunique() method, we can count unique rows in pandas. by default nunique() shows axis=0 that means rows but it can be changed to axis=1.
By using ngroups
df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6
Or using set
len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6
In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6
You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6
len(df[['col_a', 'col_b']].drop_duplicates())
# 6
Because groupby
ignore NaN
s, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaN
s in the columns:
Consider a data frame as following:
df = pd.DataFrame({
'col_a': [1,2,2,pd.np.nan,1,4],
'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})
print(df)
# col_a col_b
#0 1.0 2.0
#1 2.0 2.0
#2 2.0 3.0
#3 NaN NaN
#4 1.0 2.0
#5 4.0 NaN
Timing:
df = pd.concat([df] * 1000)
%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop
%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop
%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop
%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop
%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop
Result:
df.groupby(['col_a', 'col_b']).ngroups
# 3
len(df[['col_a', 'col_b']].drop_duplicates())
# 5
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5
len(set(zip(df['col_a'],df['col_b'])))
# 2003
len(df.groupby(['col_a', 'col_b']))
# 2003
So the difference:
Option 1:
df.groupby(['col_a', 'col_b']).ngroups
is fast, and it excludes rows that contain NaN
s.
Option 2 & 3:
len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]
Reasonably fast, it considers NaN
s as a unique value.
Option 4 & 5:
len(set(zip(df['col_a'],df['col_b'])))
len(df.groupby(['col_a', 'col_b']))
slow, and it is following the logic that numpy.nan == numpy.nan
is False, so different (nan, nan) rows are considered different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With