pandas

Question

I have a DataFrame that looks like such:

        closingDate                Time   Last
0        1997-09-09 2018-12-13 00:00:00  1000
1        1997-09-09 2018-12-13 00:01:00  1002      
2        1997-09-09 2018-12-13 00:02:00  1001   
3        1997-09-09 2018-12-13 00:03:00  1005

I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:

        closingDate            00:00:00   00:01:00   00:02:00
0        1997-09-09 2018-12-13  -0.08        0.02     -0.001    ...
1        1997-09-10 2018-12-13        ...

My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?

vielkind · Accepted Answer

You can do this with some aggregation and shifting your time series that should result in more efficient calculations.

First aggregate your data by closingDate.

g = df.groupby("closingDate")

Next you can shift your data to offset by a day.

shifted = g.shift(periods=1)

This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.

df = df.merge(shifted, left_index=True, right_index=True)

This adds the shifted columns to the new dataframe that you can use to do your difference calculation.

df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]

You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.

pandas - efficiently computing minutely returns as columns on intraday data

Tags:

python

Évariste Galois

1 Answers

vielkind

Recent Activity

Donate For Us