I have a pandas dataframe where one column represents if the location value in another column changed in the row below it. As an example,
2013-02-05 19:45:00 (39.94, -86.159) True
2013-02-05 19:50:00 (39.94, -86.159) True
2013-02-05 19:55:00 (39.94, -86.159) False
2013-02-05 20:00:00 (39.777, -85.995) False
2013-02-05 20:05:00 (39.775, -85.978) True
2013-02-05 20:10:00 (39.775, -85.978) True
2013-02-05 20:15:00 (39.775, -85.978) False
2013-02-05 20:20:00 (39.94, -86.159) True
2013-02-05 20:30:00 (39.94, -86.159) False
So, what I want to do is go row by row through this dataframe and check for the rows with False
. And then (may be add another column) which has total 'continuous' time spent in that place. The same place can be visited again like in the example above. In that case it is taken to be as a separate condition. So, for the above example, something like:
2013-02-05 19:45:00 (39.94, -86.159) True 0
2013-02-05 19:50:00 (39.94, -86.159) True 0
2013-02-05 19:55:00 (39.94, -86.159) False 15
2013-02-05 20:00:00 (39.777, -85.995) False 5
2013-02-05 20:05:00 (39.775, -85.978) True 0
2013-02-05 20:10:00 (39.775, -85.978) True 0
2013-02-05 20:15:00 (39.775, -85.978) False 15
2013-02-05 20:20:00 (39.94, -86.159) True 0
2013-02-05 20:25:00 (39.94, -86.159) False 10
I would then plot a histogram of these 'continuous' time spent using the hist() function per day. How would I get the second dataframe from the first by iterating through the dataframe? I'm new to python and pandas and the real datafile is huge so, I would need something reasonably efficient.
Here's another take
df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0)
df
date long lat condition group
2/5/2013 19:45:00 39.940 -86.159 True 0
2/5/2013 19:50:00 39.940 -86.159 True 0
2/5/2013 19:55:00 39.940 -86.159 False 0
2/5/2013 20:00:00 39.777 -85.995 False 1
2/5/2013 20:05:00 39.775 -85.978 True 2
2/5/2013 20:10:00 39.775 -85.978 True 2
2/5/2013 20:15:00 39.775 -85.978 False 2
2/5/2013 20:20:00 39.940 -86.159 True 3
2/5/2013 20:25:00 39.940 -86.159 False 3
df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf))
df
date long lat condition group result
2/5/2013 19:45:00 39.940 -86.159 True 0 15
2/5/2013 19:50:00 39.940 -86.159 True 0 15
2/5/2013 19:55:00 39.940 -86.159 False 0 15
2/5/2013 20:00:00 39.777 -85.995 False 1 5
2/5/2013 20:05:00 39.775 -85.978 True 2 15
2/5/2013 20:10:00 39.775 -85.978 True 2 15
2/5/2013 20:15:00 39.775 -85.978 False 2 15
2/5/2013 20:20:00 39.940 -86.159 True 3 10
2/5/2013 20:25:00 39.940 -86.159 False 3 10
You will need 0.11-dev. I think this will give you what you are looking for. See this section: http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas for more info as the timedeltas are a newer data that pandas is supporting
Heres your data (I separated long/lat just for convenience, the key thing is that the condition column is a bool)
In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True)
In [138]: df
Out[138]:
date long lat condition
2013-02-05 19:45:00 39.940 -86.159 True
2013-02-05 19:50:00 39.940 -86.159 True
2013-02-05 19:55:00 39.940 -86.159 False
2013-02-05 20:00:00 39.777 -85.995 False
2013-02-05 20:05:00 39.775 -85.978 True
2013-02-05 20:10:00 39.775 -85.978 True
2013-02-05 20:15:00 39.775 -85.978 False
2013-02-05 20:20:00 39.940 -86.159 True
2013-02-05 20:25:00 39.940 -86.159 False
In [139]: df.dtypes
Out[139]:
date float64
long lat float64
condition bool
dtype: object
Create some date columns that are the index (these are datetime64[ns] dtype)
In [140]: df['date'] = df.index
In [141]: df['rdate'] = df.index
Set the rdate column that are False to NaT (np.nan's are transformed to NaT)
In [142]: df.loc[~df['condition'],'rdate'] = np.nan
Forward fill the NaT's from the previous value
In [143]: df['rdate'] = df['rdate'].ffill()
Subtract the rdate from the date, this produces a timedelta64[ns] type column of the time differences
In [144]: df['diff'] = df['date']-df['rdate']
In [151]: df
Out[151]:
date long lat condition rdate \
2013-02-05 19:45:00 2013-02-05 19:45:00 -86.159 True 2013-02-05 19:45:00
2013-02-05 19:50:00 2013-02-05 19:50:00 -86.159 True 2013-02-05 19:50:00
2013-02-05 19:55:00 2013-02-05 19:55:00 -86.159 False 2013-02-05 19:50:00
2013-02-05 20:00:00 2013-02-05 20:00:00 -85.995 False 2013-02-05 19:50:00
2013-02-05 20:05:00 2013-02-05 20:05:00 -85.978 True 2013-02-05 20:05:00
2013-02-05 20:10:00 2013-02-05 20:10:00 -85.978 True 2013-02-05 20:10:00
2013-02-05 20:15:00 2013-02-05 20:15:00 -85.978 False 2013-02-05 20:10:00
2013-02-05 20:20:00 2013-02-05 20:20:00 -86.159 True 2013-02-05 20:20:00
2013-02-05 20:25:00 2013-02-05 20:25:00 -86.159 False 2013-02-05 20:20:00
diff
2013-02-05 19:45:00 00:00:00
2013-02-05 19:50:00 00:00:00
2013-02-05 19:55:00 00:05:00
2013-02-05 20:00:00 00:10:00
2013-02-05 20:05:00 00:00:00
2013-02-05 20:10:00 00:00:00
2013-02-05 20:15:00 00:05:00
2013-02-05 20:20:00 00:00:00
2013-02-05 20:25:00 00:05:00
The diff column are now timedelta64[ns], so you want integers in minutes (FYI this is a little bit clunky now as pandas doesn't have a scalar type Timedelta similar to Timestamp for dates)
(Also, you may have have to do a shift() on this rdate series before you ffill, I think I am off by 1 somewhere)...but this is the idea
In [175]: df['diff'].map(lambda x: x.item().seconds/60)
Out[175]:
2013-02-05 19:45:00 0
2013-02-05 19:50:00 0
2013-02-05 19:55:00 5
2013-02-05 20:00:00 10
2013-02-05 20:05:00 0
2013-02-05 20:10:00 0
2013-02-05 20:15:00 5
2013-02-05 20:20:00 0
2013-02-05 20:25:00 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With