I am new to pandas. I have dataframe with two columns dt (date-time stamp) and value.
Given two start and end data-time stamps: is there a easy way to create a new dataframe from original one that contains rows between the two date-time stamp?
dt value
84 7/23/2014 7:00 0.300
85 7/23/2014 7:05 0.300
86 7/23/2014 7:10 0.312
87 7/23/2014 7:15 0.300
88 7/23/2014 7:20 0.300
89 7/23/2014 7:25 0.300
90 7/23/2014 7:30 0.300
91 7/23/2014 7:35 0.300
92 7/23/2014 7:40 0.300
93 7/23/2014 7:45 0.216
94 7/23/2014 7:50 0.204
95 7/23/2014 7:55 0.228
96 7/23/2014 8:00 0.228
97 7/23/2014 8:05 0.228
98 7/23/2014 8:10 0.228
99 7/23/2014 8:15 0.240
100 7/23/2014 8:20 0.228
101 7/23/2014 8:25 0.216
102 7/23/2014 8:30 0.228
103 7/23/2014 8:35 0.324
104 7/23/2014 8:40 0.336
105 7/23/2014 8:45 0.324
106 7/23/2014 8:50 0.324
107 7/23/2014 8:55 0.324
108 7/23/2014 9:00 0.252
109 7/23/2014 9:05 0.252
110 7/23/2014 9:10 0.240
111 7/23/2014 9:15 0.240
112 7/23/2014 9:20 0.252
113 7/23/2014 9:25 0.240
.. ... ...
198 7/23/2014 16:30 0.132
199 7/23/2014 16:35 0.120
200 7/23/2014 16:40 0.180
201 7/23/2014 16:45 0.216
202 7/23/2014 16:50 0.204
203 7/23/2014 16:55 0.192
So long as dt is a datetime dtype already you can filter using date strings, if not then you can convert doing this:
df['dt'] = pd.to_datetime(df['dt'])
Then filter:
In [115]:
df[(df['dt'] > '2014-07-23 07:30:00') & (df['dt'] < '2014-07-23 09:00:00')]
Out[115]:
dt value
index
91 2014-07-23 07:35:00 0.300
92 2014-07-23 07:40:00 0.300
93 2014-07-23 07:45:00 0.216
94 2014-07-23 07:50:00 0.204
95 2014-07-23 07:55:00 0.228
96 2014-07-23 08:00:00 0.228
97 2014-07-23 08:05:00 0.228
98 2014-07-23 08:10:00 0.228
99 2014-07-23 08:15:00 0.240
100 2014-07-23 08:20:00 0.228
101 2014-07-23 08:25:00 0.216
102 2014-07-23 08:30:00 0.228
103 2014-07-23 08:35:00 0.324
104 2014-07-23 08:40:00 0.336
105 2014-07-23 08:45:00 0.324
106 2014-07-23 08:50:00 0.324
107 2014-07-23 08:55:00 0.324
You can also use query
:
In [25]: df.query('"2014-07-23 07:55:00" <= dt <= "2014-07-23 08:20:00"')
Out[25]:
dt value
95 2014-07-23 07:55:00 0.228
96 2014-07-23 08:00:00 0.228
97 2014-07-23 08:05:00 0.228
98 2014-07-23 08:10:00 0.228
99 2014-07-23 08:15:00 0.240
100 2014-07-23 08:20:00 0.228
The answer above is right, but for people who just like me stumble upon this question more than 5 years after it was posted I want to add this remark.
If you want to filter on a sorted column (and timestamps tend to be like one) it is more efficient to use the searchsorted function of pandas Series to reach O(log(n)) complexity instead of O(n).
The example below gives as a result in a difference of much more than a factor 1000. This difference can be made arbitrarily large due to the difference in complexity off course, but the chosen numbers are the ones I was using when I stumbled upon this question.
import pandas as pd
import numpy as np
import timeit
N = 500000
M = 200
data = np.hstack([np.arange(0.,N).reshape(N,1),np.random.randn(N,M-1)])
df = pd.DataFrame(data,columns=["column"+str(i) for i in range(M)])
def return_first(df):
return df[(df['column0'] > 100.5) & (df['column0'] < 400000.5)]
def return_second(df):
t1 = df['column0'].searchsorted(100.5)
t2 = df['column0'].searchsorted(400000.5)
return df.loc[t1:t2-1]
if __name__ == '__main__':
t = timeit.timeit(lambda: return_first(df), number=100)
print(t)
t = timeit.timeit(lambda: return_second(df), number=100)
print(t)
results:
59.1751627
0.015401400000001786
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With