Pandas aggregate count distinct

People also ask

Where is distinct count in pandas?

You can get the count distinct values (equivalent to SQL count(distinct) ) in pandas using DataFrame. groupby(), nunique() , DataFrame. agg(), DataFrame.

What does aggregate function do in pandas?

What are pandas aggregate functions? Similar to SQL, pandas also supports multiple aggregate functions that perform a calculation on a set of values (grouped data) and return a single value. An aggregate is a function where the values of multiple rows are grouped together to form a single summary value.

How about either of:

>>> df
         date  duration user_id
0  2013-04-01        30    0001
1  2013-04-01        15    0001
2  2013-04-01        20    0002
3  2013-04-02        15    0002
4  2013-04-02        30    0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
            duration  user_id
date                         
2013-04-01        65        2
2013-04-02        45        1

'nunique' is an option for .agg() since pandas 0.20.0, so:

df.groupby('date').agg({'duration': 'sum', 'user_id': 'nunique'})

Just adding to the answers already given, the solution using the string "nunique" seems much faster, tested here on ~21M rows dataframe, then grouped to ~2M

%time _=g.agg({"id": lambda x: x.nunique()})
CPU times: user 3min 3s, sys: 2.94 s, total: 3min 6s
Wall time: 3min 20s

%time _=g.agg({"id": pd.Series.nunique})
CPU times: user 3min 2s, sys: 2.44 s, total: 3min 4s
Wall time: 3min 18s

%time _=g.agg({"id": "nunique"})
CPU times: user 14 s, sys: 4.76 s, total: 18.8 s
Wall time: 24.4 s

Related questions
                            
                                What is "thread local storage" in Python, and why do I need it?
                            
                                How do I install Python packages in Google's Colab?
                            
                                Python/Django: log to console under runserver, log to file under Apache
                            
                                Which is more preferable to use: lambda functions or nested functions ('def')?
                            
                                How to suppress or capture the output of subprocess.run()?
                            
                                Python Selenium accessing HTML source
                            
                                How to override the copy/deepcopy operations for a Python object?
                            
                                Case Insensitive Flask-SQLAlchemy Query
                            
                                What's the meaning of "(1,) == 1," in Python?
                            
                                Numpy `logical_or` for more than two arguments
                            
                                sqlalchemy: how to join several tables by one query?
                            
                                urllib2.HTTPError: HTTP Error 403: Forbidden
                            
                                Python [Errno 98] Address already in use
                            
                                Get folder name of the file in Python
                            
                                In Python list comprehension is it possible to access the item index?
                            
                                deciding among subprocess, multiprocessing, and thread in Python?
                            
                                Loading and parsing a JSON file with multiple JSON objects
                            
                                How to calculate cumulative normal distribution?
                            
                                How do I read text from the clipboard?
                            
                                How to reset db in Django? I get a command 'reset' not found error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas aggregate count distinct

Tags:

python

pandas

People also ask

Recent Activity

Donate For Us