Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up resample procedure in Pandas?

Say you have a dataframe of 1 minute time series with index, 4 columns and 4 million rows. When you try to do something like:

 conversion = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
 df1 = df.resample('5Min', how=conversion)

It takes an absurd amount of time (20-30 minutes). How can I speed up this process?

Pandas 18, Python 2.7

like image 319
hernanavella Avatar asked May 01 '16 19:05

hernanavella


People also ask

How do you resample hourly data in Python?

Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.

What is resample (' MS ') in Python?

The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

Does Pandas use Multiplecores?

Operations on data frame using Pandas is slow, as it uses a single-core of CPU to perform the computations, and does not take advantage of a multi-core CPU.


Video Answer


1 Answers

Resample seems to work quite fast on a dataset of size (4000000, 4):

idx = pd.date_range('1/1/2010', periods=4000000, freq='T')
df = pd.DataFrame(np.random.rand(4000000, 4), columns = ["Open", "High", "Low", "Close"], index = idx)
%timeit df.resample("5Min").agg(conversion)
1 loop, best of 3: 253 ms per loop

With an irregular index and some nan's:

idx1 = pd.date_range('1/1/1900', periods=10000000, freq='Min')
idx2 = pd.date_range('1/1/1940', periods=10000000, freq='Min')
idx3 = pd.date_range('1/1/1980', periods=10000000, freq='Min')
idx4 = pd.date_range('1/1/2020', periods=10000000, freq='Min')
idx = np.array([np.random.choice(idx1, 1000000), np.random.choice(idx2, 1000000), np.random.choice(idx3, 1000000), 
                np.random.choice(idx4, 1000000)]).flatten()
np.random.shuffle(idx)
df = pd.DataFrame(np.random.randint(100, size=(4000000, 4)), columns = ["Open", "High", "Low", "Close"], index = idx)
df.loc[np.random.choice(idx, 100000), "Open"] = np.nan
df.loc[np.random.choice(idx, 50000), "High"] = np.nan
df.loc[np.random.choice(idx, 500000), "Low"] = np.nan
df.loc[np.random.choice(idx, 20000), "Close"] = np.nan
%timeit df.resample("5Min").agg(conversion)
1 loop, best of 3: 5.06 s per loop

So it seems like something other than resample is taking time for your case.

like image 139
ayhan Avatar answered Oct 27 '22 05:10

ayhan