I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate
.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last
value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate
and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN
since the values won't be shifted across separate days.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With