Count unique values per groups with Pandas [duplicate]

People also ask

How can I count duplicate values in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

You need nunique:

df = df.groupby('domain')['ID'].nunique()

print (df)
domain
'facebook.com'    1
'google.com'      1
'twitter.com'     2
'vk.com'          3
Name: ID, dtype: int64

If you need to strip ' characters:

df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
print (df)
domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
Name: ID, dtype: int64

Or as Jon Clements commented:

df.groupby(df.domain.str.strip("'"))['ID'].nunique()

You can retain the column name like this:

df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
print(df)
    domain  ID
0       fb   1
1      ggl   1
2  twitter   2
3       vk   3

The difference is that nunique() returns a Series and agg() returns a DataFrame.

Generally to count distinct values in single column, you can use Series.value_counts:

df.domain.value_counts()

#'vk.com'          5
#'twitter.com'     2
#'facebook.com'    1
#'google.com'      1
#Name: domain, dtype: int64

To see how many unique values in a column, use Series.nunique:

df.domain.nunique()
# 4

To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:

df.domain.unique()
# array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)

df.domain.drop_duplicates()
#0          'vk.com'
#2     'twitter.com'
#4    'facebook.com'
#6      'google.com'
#Name: domain, dtype: object

As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

import pandas as pd
df.drop_duplicates().domain.value_counts()

# 'vk.com'          3
# 'twitter.com'     2
# 'facebook.com'    1
# 'google.com'      1
# Name: domain, dtype: int64

df.domain.value_counts()

>>> df.domain.value_counts()

vk.com          5

twitter.com     2

google.com      1

facebook.com    1

Name: domain, dtype: int64

If I understand correctly, you want the number of different IDs for every domain. Then you can try this:

output = df.drop_duplicates()
output.groupby('domain').size()

Output:

    domain
facebook.com    1
google.com      1
twitter.com     2
vk.com          3
dtype: int64

You could also use value_counts, which is slightly less efficient. But the best is Jezrael's answer using nunique:

%timeit df.drop_duplicates().groupby('domain').size()
1000 loops, best of 3: 939 µs per loop
%timeit df.drop_duplicates().domain.value_counts()
1000 loops, best of 3: 1.1 ms per loop
%timeit df.groupby('domain')['ID'].nunique()
1000 loops, best of 3: 440 µs per loop

Related questions
                            
                                How to count the number of files in a directory using Python
                            
                                Python Create unix timestamp five minutes in the future
                            
                                PATH issue with pytest 'ImportError: No module named YadaYadaYada'
                            
                                What is the right way to treat Python argparse.Namespace() as a dictionary?
                            
                                How to take column-slices of dataframe in pandas
                            
                                Retrieving the output of subprocess.call() [duplicate]
                            
                                How can I find the current OS in Python? [duplicate]
                            
                                Python list subtraction operation
                            
                                numpy: most efficient frequency counts for unique values in an array
                            
                                Using pickle.dump - TypeError: must be str, not bytes
                            
                                Splitting on last delimiter in Python string?
                            
                                NumPy array initialization (fill with identical values)
                            
                                Python Image Library fails with message "decoder JPEG not available" - PIL
                            
                                Dropping infinite values from dataframes in pandas?
                            
                                How do I get the object if it exists, or None if it does not exist in Django?
                            
                                ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
                            
                                In Python, how do you convert seconds since epoch to a `datetime` object?
                            
                                Getting a hidden password input
                            
                                How do I plot in real-time in a while loop using matplotlib?
                            
                                Getting the name of a variable as a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count unique values per groups with Pandas [duplicate]

Tags:

python

pandas

unique

group-by

pandas-groupby

People also ask

Recent Activity

Donate For Us