Say you have a dataframe of 1 minute time series with index, 4 columns and 4 million rows. When you try to do something like:
conversion = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
df1 = df.resample('5Min', how=conversion)
It takes an absurd amount of time (20-30 minutes). How can I speed up this process?
Pandas 18, Python 2.7
Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.
The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Operations on data frame using Pandas is slow, as it uses a single-core of CPU to perform the computations, and does not take advantage of a multi-core CPU.
Resample seems to work quite fast on a dataset of size (4000000, 4):
idx = pd.date_range('1/1/2010', periods=4000000, freq='T')
df = pd.DataFrame(np.random.rand(4000000, 4), columns = ["Open", "High", "Low", "Close"], index = idx)
%timeit df.resample("5Min").agg(conversion)
1 loop, best of 3: 253 ms per loop
With an irregular index and some nan's:
idx1 = pd.date_range('1/1/1900', periods=10000000, freq='Min')
idx2 = pd.date_range('1/1/1940', periods=10000000, freq='Min')
idx3 = pd.date_range('1/1/1980', periods=10000000, freq='Min')
idx4 = pd.date_range('1/1/2020', periods=10000000, freq='Min')
idx = np.array([np.random.choice(idx1, 1000000), np.random.choice(idx2, 1000000), np.random.choice(idx3, 1000000),
np.random.choice(idx4, 1000000)]).flatten()
np.random.shuffle(idx)
df = pd.DataFrame(np.random.randint(100, size=(4000000, 4)), columns = ["Open", "High", "Low", "Close"], index = idx)
df.loc[np.random.choice(idx, 100000), "Open"] = np.nan
df.loc[np.random.choice(idx, 50000), "High"] = np.nan
df.loc[np.random.choice(idx, 500000), "Low"] = np.nan
df.loc[np.random.choice(idx, 20000), "Close"] = np.nan
%timeit df.resample("5Min").agg(conversion)
1 loop, best of 3: 5.06 s per loop
So it seems like something other than resample is taking time for your case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With