Performance issues with groupby's last in pandas

Tags:

pandas

I have a DataFrame with two columns and a little over one-hundred thousand elements.

In [43]: df.head(10)
Out[43]:
                    localtime                 ref
4  2014-04-02 12:00:00.273537  139058754703810577
5  2014-04-02 12:00:02.223501  139058754703810576
6  2014-04-02 12:00:03.518817  139058754703810576
7  2014-04-02 12:00:03.572082  139058754703810576
8  2014-04-02 12:00:03.572444  139058754703810576
9  2014-04-02 12:00:03.572571  139058754703810576
10 2014-04-02 12:00:03.573320  139058754703810576
11 2014-04-02 12:00:09.278517  139058754703810576
14 2014-04-02 12:00:20.942802  139058754703810577
15 2014-04-02 12:01:13.410607  139058754703810576

[10 rows x 2 columns]

In [44]: df.dtypes
Out[44]:
localtime    datetime64[ns]
ref                   int64
dtype: object

In [45]: len(df)
Out[45]: 111743

In [46]: g = df.groupby('ref')

If I request the last element from my group, the function just hangs!

In [47]: %timeit g.last()

I killed it after 6 minutes; top shows the CPU at 100% the entire time.

If I request the localtime column explicitly, this will at least return, though it still seems absurdly slow for how few elements there are.

In [48]: %timeit g['localtime'].last()
1 loops, best of 3: 4.6 s per loop

Is there something I'm missing? This is pandas 0.13.1.

This issue appears with the datetime64 type. Suppose I read directly from a file:

In [1]: import pandas as pd

In [2]: df = pd.read_csv('so.csv')

In [3]: df.dtypes
Out[3]:
localtime    object
ref           int64
dtype: object

In [4]: %timeit df.groupby('ref').last()
10 loops, best of 3: 28.1 ms per loop

The object type works just fine. However, all hell breaks loose if I cast my timestamp:

In [5]: df.localtime = pd.to_datetime(df.localtime)

In [6]: df.dtypes
Out[6]:
localtime    datetime64[ns]
ref                   int64
dtype: object

In [7]: %timeit df.groupby('ref').last()

The plot thickens.

Reproducing without a data file, using Jeff's suggestion:

In [70]: rng = pd.date_range('20130101',periods=20,freq='s')

In [71]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))

In [72]: %timeit df.groupby('value').last()
1 loops, best of 3: 332 ms per loop

However, if I change the range of random integers, then the problem occurs again!

In [73]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100000,size=100000)*1000))

In [74]: %timeit df.groupby('value').last()

I simply increased the high parameter of the second randint(), which means that the groupby() will have a greater length. This reproduces my error without a data file.

Note that if I forgo datetime64 types, then there is no problem:

In [12]: df = pd.DataFrame(dict(timestamp = np.random.randint(0,20,size=100000), value = np.random.randint(0,100000,size=100000)*1000))

In [13]: %timeit df.groupby('value').last()
100 loops, best of 3: 14.4 ms per loop

So the culprit is in scaling last() on datetime64.

408

asked Apr 03 '14 18:04

1 Answers

Must be something odd going on....looks ok in 0.13.1 (and master). Post a link to your file and i'll take a look.

In [3]: rng = date_range('20130101',periods=20,freq='s')

In [4]: df = DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
timestamp    100000 non-null datetime64[ns]
value        100000 non-null int64
dtypes: datetime64[ns](1), int64(1)
In [6]: %timeit df.groupby('value')['timestamp'].last()
100 loops, best of 3: 9.07 ms per loop

In [7]: %timeit df.groupby('value')['timestamp'].tail(1)
100 loops, best of 3: 16.3 ms per loop

Ok here's the explanation:

Using np.random.randint(0,100,size=100000) for value, creates 100 groups, while np.random.randint(0,100000,size=100000) creates a lot more (in my example 63000) or so.

.last (in < 0.14) implicity does the last of the non-nan values. This na testing is not that cheap, so this has poor scaling performance (and is done in python space for each group).

tail(1) on the other hand (in < 0.14) does NOT check for this so perf is much better (and uses a cython routing to get the results).

In 0.14 these will be the same (and even if you do it like this: nth(-1,dropna='any') which will replicate what last is doing here, this is done in a way to have much better perf. (thanks @Andy Hayden).

Bottom line is to use tail(1) in < 0.14.

147

answered Oct 25 '22 23:10

Jeff

Related questions
                            
                                WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
                            
                                python downloading is extremely slow
                            
                                Minimizing the performance issues of loading a many to many relationship
                            
                                cannot import name get_user_model
                            
                                How do I conditionally include a file in a Sphinx 'toctree'? [duplicate]
                            
                                Python: Convert utf-8 string to byte string
                            
                                How to find hidden attributes of Python objects? (attributes that don't appear in the dir(obj) list)
                            
                                Django on AWS Elasticbeanstalk: Command 02_createadmin failed
                            
                                Send EOF to PyCharm console in Windows
                            
                                Python - Selenium - How to use Browser Shortcuts
                            
                                Brew fails to install Python: can not symlink
                            
                                Logging INFO to Django Server's STDOUT
                            
                                Word segmentation using dynamic programming
                            
                                One secondary label missing when plotting x and y secondary axis
                            
                                Creating custom fields with Django allauth
                            
                                How to convert ReStructuredText docs to ipython notebooks?
                            
                                Python scoping issue with dictionary comprehension inside class level code
                            
                                Exactly replicating R text preprocessing in python
                            
                                Using Akka actors to invoke or pass messages to Python code
                            
                                SymPy/SciPy: solving a system of ordinary differential equations with different variables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance issues with groupby's last in pandas

Tags:

python

pandas

chrisaycock

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us