I have a panel in pandas and am trying to calculate the amount of time that an individual spends in each stage. To give a better sense of this my dataset is as follows:
group date stage
A 2014-01-01 one
A 2014-01-03 one
A 2014-01-04 one
A 2014-01-05 two
B 2014-01-02 four
B 2014-01-06 five
B 2014-01-10 five
C 2014-01-03 two
C 2014-01-05 two
I'm looking to calculate stage duration to give:
group date stage dur
A 2014-01-01 one 0
A 2014-01-03 one 2
A 2014-01-04 one 3
A 2014-01-05 two 0
B 2014-01-02 four 0
B 2014-01-06 five 0
B 2014-01-10 five 4
C 2014-01-03 two 0
C 2014-01-05 two 2
The method that I'm using below is extremely slow. Any ideas on a quicker method?
df['stage_duration'] = df.groupby(['group', 'stage']).date.apply(lambda y: (y - y.iloc[0])).apply(lambda y:y / np.timedelta64(1, 'D')))
There are several ways to calculate the time difference between two dates in Python using Pandas. The first is to subtract one date from the other. This returns a timedelta such as 0 days 05:00:00 that tells us the number of days, hours, minutes, and seconds between the two dates.
Comparison between pandas timestamp objects is carried out using simple comparison operators: >, <,==,< = , >=. The difference can be calculated using a simple '–' operator. Given time can be converted to pandas timestamp using pandas. Timestamp() method.
diff() function. This function calculates the difference between two consecutive DataFrame elements.
rolling() function is a very useful function. It Provides rolling window calculations over the underlying data in the given Series object. Syntax: Series.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
Based your code (your groupby/apply
), it looks like (despite your example ... but maybe I misunderstand what you want and then what Andy did would be the best idea) that you're working with a 'date' column that is a datetime64
dtype and not an integer
dtype in your actual data. Also it looks like you want compute the change in days as measured from the first observation of a given group/stage
. I think this is a better set of example data (if I understand your goal correctly):
>>> df
group date stage dur
0 A 2014-01-01 one 0
1 A 2014-01-03 one 2
2 A 2014-01-04 one 3
3 A 2014-01-05 two 0
4 B 2014-01-02 four 0
5 B 2014-01-06 five 0
6 B 2014-01-10 five 4
7 C 2014-01-03 two 0
8 C 2014-01-05 two 2
Given that you should get some speed-up from just modifying your apply (as Jeff suggests in his comment) by dividing through by the timedelta64
in a vectorized way after the apply (or you could do it in the apply):
>>> df['dur'] = df.groupby(['group','stage']).date.apply(lambda x: x - x.iloc[0])
>>> df['dur'] /= np.timedelta64(1,'D')
>>> df
group date stage dur
0 A 2014-01-01 one 0
1 A 2014-01-03 one 2
2 A 2014-01-04 one 3
3 A 2014-01-05 two 0
4 B 2014-01-02 four 0
5 B 2014-01-06 five 0
6 B 2014-01-10 five 4
7 C 2014-01-03 two 0
8 C 2014-01-05 two 2
But you can also avoid the groupby/apply
given your data is in group,stage,date order. The first date for every ['group','stage']
grouping happens when either the group changes or the stage changes. So I think you can do something like the following:
>>> beg = (df.group != df.group.shift(1)) | (df.stage != df.stage.shift(1))
>>> df['dur'] = (df['date'] - df['date'].where(beg).ffill())/np.timedelta64(1,'D')
>>> df
group date stage dur
0 A 2014-01-01 one 0
1 A 2014-01-03 one 2
2 A 2014-01-04 one 3
3 A 2014-01-05 two 0
4 B 2014-01-02 four 0
5 B 2014-01-06 five 0
6 B 2014-01-10 five 4
7 C 2014-01-03 two 0
8 C 2014-01-05 two 2
Explanation: Note what df['date'].where(beg)
creates:
>>> beg = (df.group != df.group.shift(1)) | (df.stage != df.stage.shift(1))
>>> df['date'].where(beg)
0 2014-01-01
1 NaT
2 NaT
3 2014-01-05
4 2014-01-02
5 2014-01-06
6 NaT
7 2014-01-03
8 NaT
And then I ffill
the values and take the difference with the 'date' column.
Edit: As Andy points out you could also use transform
:
>>> df['dur'] = df.date - df.groupby(['group','stage']).date.transform(lambda x: x.iloc[0])
>>> df['dur'] /= np.timedelta64(1,'D')
group date stage dur
0 A 2014-01-01 one 0
1 A 2014-01-03 one 2
2 A 2014-01-04 one 3
3 A 2014-01-05 two 0
4 B 2014-01-02 four 0
5 B 2014-01-06 five 0
6 B 2014-01-10 five 4
7 C 2014-01-03 two 0
8 C 2014-01-05 two 2
Speed: I timed the two method using a similar dataframe with 400,000 observations:
Apply method:
1 loops, best of 3: 18.3 s per loop
Non-apply method:
1 loops, best of 3: 1.64 s per loop
So I think avoiding the apply could give some significant speed-ups
I think I'd use diff
here:
In [11]: df.groupby('stage')['date'].diff().fillna(0)
Out[11]:
0 0
1 2
2 0
3 0
4 0
5 4
dtype: float64
(Assuming that the stages are contiguous.)
If you are just subtracting the first in each group, use a transform:
In [21]: df['date'] - df.groupby('stage')['date'].transform(lambda x: x.iloc[0])
Out[21]:
0 0
1 2
2 0
3 0
4 0
5 4
Name: date, dtype: int64
Note: this is probably significantly faster...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With