I have dataframe in pandas:
In [10]: df
Out[10]:
    col_a    col_b  col_c  col_d
0  France    Paris      3      4
1      UK    Londo      4      5
2      US  Chicago      5      6
3      UK  Bristol      3      3
4      US    Paris      8      9
5      US   London     44      4
6      US  Chicago     12      4
I need to count unique cities. I can count unique states
In [11]: df['col_a'].nunique()
Out[11]: 3
and I can try to count unique cities
In [12]: df['col_b'].nunique()
Out[12]: 5
but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:
In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']
In [14]: df
Out[14]:
    col_a    col_b  col_c  col_d         col_a_b
0  France    Paris      3      4  France - Paris
1      UK    Londo      4      5      UK - Londo
2      US  Chicago      5      6    US - Chicago
3      UK  Bristol      3      3    UK - Bristol
4      US    Paris      8      9      US - Paris
5      US   London     44      4     US - London
6      US  Chicago     12      4    US - Chicago
In [15]: df['col_a_b'].nunique()
Out[15]: 6
Maybe there is a better way? Without creating an additional column.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
The nunique() function Here, df is the dataframe for which you want to know the unique counts. It returns a pandas Series of counts. By default, the pandas dataframe nunique() function counts the distinct values along axis=0 , that is, row-wise which gives you the count of distinct values in each column.
Count Unique Rows in Pandas DataFrame Using nunique() method, we can count unique rows in pandas. by default nunique() shows axis=0 that means rows but it can be changed to axis=1.
By using ngroups
df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6
Or using set
len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6
                        In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6
                        You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6
len(df[['col_a', 'col_b']].drop_duplicates())
# 6
Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:
Consider a data frame as following:
df = pd.DataFrame({
    'col_a': [1,2,2,pd.np.nan,1,4],
    'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})
print(df)
#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN
Timing:
df = pd.concat([df] * 1000)
%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop
%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop
%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop    
%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop
%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop
Result:
df.groupby(['col_a', 'col_b']).ngroups
# 3
len(df[['col_a', 'col_b']].drop_duplicates())
# 5
df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5
len(set(zip(df['col_a'],df['col_b'])))
# 2003
len(df.groupby(['col_a', 'col_b']))
# 2003
So the difference:
Option 1:
df.groupby(['col_a', 'col_b']).ngroups
is fast, and it excludes rows that contain NaNs.
Option 2 & 3:
len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]
Reasonably fast, it considers NaNs as a unique value.
Option 4 & 5:
len(set(zip(df['col_a'],df['col_b']))) 
len(df.groupby(['col_a', 'col_b'])) 
slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With