How to count unique records by two columns in pandas?

Tags:

I have dataframe in pandas:

In [10]: df
Out[10]:
    col_a    col_b  col_c  col_d
0  France    Paris      3      4
1      UK    Londo      4      5
2      US  Chicago      5      6
3      UK  Bristol      3      3
4      US    Paris      8      9
5      US   London     44      4
6      US  Chicago     12      4

I need to count unique cities. I can count unique states

In [11]: df['col_a'].nunique()
Out[11]: 3

and I can try to count unique cities

In [12]: df['col_b'].nunique()
Out[12]: 5

but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:

In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']

In [14]: df
Out[14]:
    col_a    col_b  col_c  col_d         col_a_b
0  France    Paris      3      4  France - Paris
1      UK    Londo      4      5      UK - Londo
2      US  Chicago      5      6    US - Chicago
3      UK  Bristol      3      3    UK - Bristol
4      US    Paris      8      9      US - Paris
5      US   London     44      4     US - London
6      US  Chicago     12      4    US - Chicago

In [15]: df['col_a_b'].nunique()
Out[15]: 6

Maybe there is a better way? Without creating an additional column.

353

asked Oct 30 '17 20:10

GhostKU

3 Answers

By using ngroups

df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6

Or using set

len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6

193

answered Oct 19 '22 02:10

BENY

In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6

answered Oct 19 '22 02:10

MaxU - stop WAR against UA

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6

len(df[['col_a', 'col_b']].drop_duplicates())
# 6

Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:

Consider a data frame as following:

df = pd.DataFrame({
    'col_a': [1,2,2,pd.np.nan,1,4],
    'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})

print(df)

#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN

Timing:

df = pd.concat([df] * 1000)

%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop

%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop

%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop    

%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop

%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop

Result:

df.groupby(['col_a', 'col_b']).ngroups
# 3

len(df[['col_a', 'col_b']].drop_duplicates())
# 5

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5

len(set(zip(df['col_a'],df['col_b'])))
# 2003

len(df.groupby(['col_a', 'col_b']))
# 2003

So the difference:

Option 1:

df.groupby(['col_a', 'col_b']).ngroups

is fast, and it excludes rows that contain NaNs.

Option 2 & 3:

len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]

Reasonably fast, it considers NaNs as a unique value.

Option 4 & 5:

len(set(zip(df['col_a'],df['col_b']))) 
len(df.groupby(['col_a', 'col_b']))

slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.

answered Oct 19 '22 03:10

Psidom

Related questions
                            
                                Using $implict to pass multiple parameters
                            
                                Spring Boot 2.0.0 , DataSourceBuilder not found in autoconfigure jar
                            
                                Present modally in Flutter?
                            
                                Matplotlib throws warning message because of findfont - python
                            
                                SQL Server: Copying column within table
                            
                                How to get the FTP error when using PHP
                            
                                What happens to an iPhone app when iPhone goes into stand-by mode?
                            
                                What's the best way to trim whitespace from a string in Cocoa Touch?
                            
                                What's the difference between "C system calls" and "C library routines"?
                            
                                Correct place to initialize class variables?
                            
                                How do I ignore mouse events on child elements in jQuery?
                            
                                MySQL/SQL: Update with correlated subquery from the updated table itself

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count unique records by two columns in pandas?

Tags:

python

pandas

dataframe

group-by