Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Performance issues with groupby's last in pandas




I have a DataFrame with two columns and a little over one-hundred thousand elements.

In [43]: df.head(10)
                    localtime                 ref
4  2014-04-02 12:00:00.273537  139058754703810577
5  2014-04-02 12:00:02.223501  139058754703810576
6  2014-04-02 12:00:03.518817  139058754703810576
7  2014-04-02 12:00:03.572082  139058754703810576
8  2014-04-02 12:00:03.572444  139058754703810576
9  2014-04-02 12:00:03.572571  139058754703810576
10 2014-04-02 12:00:03.573320  139058754703810576
11 2014-04-02 12:00:09.278517  139058754703810576
14 2014-04-02 12:00:20.942802  139058754703810577
15 2014-04-02 12:01:13.410607  139058754703810576

[10 rows x 2 columns]

In [44]: df.dtypes
localtime    datetime64[ns]
ref                   int64
dtype: object

In [45]: len(df)
Out[45]: 111743

In [46]: g = df.groupby('ref')

If I request the last element from my group, the function just hangs!

In [47]: %timeit g.last()

I killed it after 6 minutes; top shows the CPU at 100% the entire time.

If I request the localtime column explicitly, this will at least return, though it still seems absurdly slow for how few elements there are.

In [48]: %timeit g['localtime'].last()
1 loops, best of 3: 4.6 s per loop

Is there something I'm missing? This is pandas 0.13.1.

This issue appears with the datetime64 type. Suppose I read directly from a file:

In [1]: import pandas as pd

In [2]: df = pd.read_csv('so.csv')

In [3]: df.dtypes
localtime    object
ref           int64
dtype: object

In [4]: %timeit df.groupby('ref').last()
10 loops, best of 3: 28.1 ms per loop

The object type works just fine. However, all hell breaks loose if I cast my timestamp:

In [5]: df.localtime = pd.to_datetime(df.localtime)

In [6]: df.dtypes
localtime    datetime64[ns]
ref                   int64
dtype: object

In [7]: %timeit df.groupby('ref').last()

The plot thickens.

Reproducing without a data file, using Jeff's suggestion:

In [70]: rng = pd.date_range('20130101',periods=20,freq='s')

In [71]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))

In [72]: %timeit df.groupby('value').last()
1 loops, best of 3: 332 ms per loop

However, if I change the range of random integers, then the problem occurs again!

In [73]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100000,size=100000)*1000))

In [74]: %timeit df.groupby('value').last()                                                           

I simply increased the high parameter of the second randint(), which means that the groupby() will have a greater length. This reproduces my error without a data file.

Note that if I forgo datetime64 types, then there is no problem:

In [12]: df = pd.DataFrame(dict(timestamp = np.random.randint(0,20,size=100000), value = np.random.randint(0,100000,size=100000)*1000))

In [13]: %timeit df.groupby('value').last()
100 loops, best of 3: 14.4 ms per loop

So the culprit is in scaling last() on datetime64.

like image 408
chrisaycock Avatar asked Apr 03 '14 18:04


People also ask

Does pandas Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

Does pandas Groupby maintain order?

Groupby preserves the order of rows within each group.

What is As_index false in pandas?

as_index=False is effectively “SQL-style” grouped output. Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group.

1 Answers

Must be something odd going on....looks ok in 0.13.1 (and master). Post a link to your file and i'll take a look.

In [3]: rng = date_range('20130101',periods=20,freq='s')

In [4]: df = DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
timestamp    100000 non-null datetime64[ns]
value        100000 non-null int64
dtypes: datetime64[ns](1), int64(1)
In [6]: %timeit df.groupby('value')['timestamp'].last()
100 loops, best of 3: 9.07 ms per loop

In [7]: %timeit df.groupby('value')['timestamp'].tail(1)
100 loops, best of 3: 16.3 ms per loop

Ok here's the explanation:

Using np.random.randint(0,100,size=100000) for value, creates 100 groups, while np.random.randint(0,100000,size=100000) creates a lot more (in my example 63000) or so.

.last (in < 0.14) implicity does the last of the non-nan values. This na testing is not that cheap, so this has poor scaling performance (and is done in python space for each group).

tail(1) on the other hand (in < 0.14) does NOT check for this so perf is much better (and uses a cython routing to get the results).

In 0.14 these will be the same (and even if you do it like this: nth(-1,dropna='any') which will replicate what last is doing here, this is done in a way to have much better perf. (thanks @Andy Hayden).

Bottom line is to use tail(1) in < 0.14.

like image 147
Jeff Avatar answered Oct 25 '22 23:10
