I have a DataFrame with two columns and a little over one-hundred thousand elements.
In [43]: df.head(10)
Out[43]:
localtime ref
4 2014-04-02 12:00:00.273537 139058754703810577
5 2014-04-02 12:00:02.223501 139058754703810576
6 2014-04-02 12:00:03.518817 139058754703810576
7 2014-04-02 12:00:03.572082 139058754703810576
8 2014-04-02 12:00:03.572444 139058754703810576
9 2014-04-02 12:00:03.572571 139058754703810576
10 2014-04-02 12:00:03.573320 139058754703810576
11 2014-04-02 12:00:09.278517 139058754703810576
14 2014-04-02 12:00:20.942802 139058754703810577
15 2014-04-02 12:01:13.410607 139058754703810576
[10 rows x 2 columns]
In [44]: df.dtypes
Out[44]:
localtime datetime64[ns]
ref int64
dtype: object
In [45]: len(df)
Out[45]: 111743
In [46]: g = df.groupby('ref')
If I request the last element from my group, the function just hangs!
In [47]: %timeit g.last()
I killed it after 6 minutes; top
shows the CPU at 100% the entire time.
If I request the localtime
column explicitly, this will at least return, though it still seems absurdly slow for how few elements there are.
In [48]: %timeit g['localtime'].last()
1 loops, best of 3: 4.6 s per loop
Is there something I'm missing? This is pandas 0.13.1.
This issue appears with the datetime64
type. Suppose I read directly from a file:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('so.csv')
In [3]: df.dtypes
Out[3]:
localtime object
ref int64
dtype: object
In [4]: %timeit df.groupby('ref').last()
10 loops, best of 3: 28.1 ms per loop
The object
type works just fine. However, all hell breaks loose if I cast my timestamp:
In [5]: df.localtime = pd.to_datetime(df.localtime)
In [6]: df.dtypes
Out[6]:
localtime datetime64[ns]
ref int64
dtype: object
In [7]: %timeit df.groupby('ref').last()
The plot thickens.
Reproducing without a data file, using Jeff's suggestion:
In [70]: rng = pd.date_range('20130101',periods=20,freq='s')
In [71]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))
In [72]: %timeit df.groupby('value').last()
1 loops, best of 3: 332 ms per loop
However, if I change the range of random integers, then the problem occurs again!
In [73]: df = pd.DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100000,size=100000)*1000))
In [74]: %timeit df.groupby('value').last()
I simply increased the high
parameter of the second randint()
, which means that the groupby()
will have a greater length. This reproduces my error without a data file.
Note that if I forgo datetime64
types, then there is no problem:
In [12]: df = pd.DataFrame(dict(timestamp = np.random.randint(0,20,size=100000), value = np.random.randint(0,100000,size=100000)*1000))
In [13]: %timeit df.groupby('value').last()
100 loops, best of 3: 14.4 ms per loop
So the culprit is in scaling last()
on datetime64
.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
Groupby preserves the order of rows within each group.
as_index=False is effectively “SQL-style” grouped output. Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group.
Must be something odd going on....looks ok in 0.13.1 (and master). Post a link to your file and i'll take a look.
In [3]: rng = date_range('20130101',periods=20,freq='s')
In [4]: df = DataFrame(dict(timestamp = rng.take(np.random.randint(0,20,size=100000)), value = np.random.randint(0,100,size=100000)*1000000))
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
timestamp 100000 non-null datetime64[ns]
value 100000 non-null int64
dtypes: datetime64[ns](1), int64(1)
In [6]: %timeit df.groupby('value')['timestamp'].last()
100 loops, best of 3: 9.07 ms per loop
In [7]: %timeit df.groupby('value')['timestamp'].tail(1)
100 loops, best of 3: 16.3 ms per loop
Ok here's the explanation:
Using np.random.randint(0,100,size=100000)
for value, creates 100 groups,
while np.random.randint(0,100000,size=100000)
creates a lot more (in my example
63000) or so.
.last
(in < 0.14) implicity does the last of the non-nan
values. This na testing is not that cheap, so this has poor scaling performance (and is done in python space for each group).
tail(1)
on the other hand (in < 0.14) does NOT check for this so perf is much better (and uses a cython routing to get the results).
In 0.14 these will be the same (and even if you do it like this: nth(-1,dropna='any')
which will replicate what last
is doing here, this is done in a way to have much better perf. (thanks @Andy Hayden).
Bottom line is to use tail(1)
in < 0.14.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With