Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pickle files loading time

I was just testing which file format suits my problem best and benchmarked a few formats. While doing this I stumbled upon the following, in my opinion, weird time difference when loading files.

I created a random large DataFrame for testing with the following code (since my problem uses time series data its also used in my example):

rows = 5000000
cols = 50

df = pd.DataFrame(
    data=np.random.rand(rows, cols),
    index=pd.date_range(
        start='2014-01-01',
        periods=rows,
        freq='min'
    ),
)

Then I saved the DataFrame using df.to_pickle('test.pkl') and started benchmarking the loading time. Since I am dealing with large datasets I also wanted to know how the loading time scales when only needing part of the time series. I mainly used this to see if other formats that are able to only load part of the file are able to get smaller loading times when doing so. Since pickle files need to be loaded completely I just included a .loc[time_stamp:] to get the same slicing in the end. These are my results:

In [1]: %timeit pd.read_pickle('test.pkl')
Out[1]: 1.94 s ± 182 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [2]: %timeit pd.read_pickle('test.pkl').loc[pd.to_datetime('20200101 00:00'):]
Out[2]: 1.73 s ± 237 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I know the time differences are not large, but I tested this multiple times over the day and the second method (with slicing) is always the faster one. Which gets to my "problem". Why is the loading method using .loc[time_stamp:] faster? I thought both attempts are doing exactly the same thing, namely loading the whole pickle file. The only difference would be using approx. half of the DataFrame in the second approach, which shouldnt make a big difference. What am I missing?

Next question would be: Is this a way to actually save time when using pickle? Just slicing the DataFrame in the line I am loading it?

Hope someone can get some insights into this or maybe testing the same thing and proving me wrong with my observations.

like image 511
JD. Avatar asked May 20 '26 22:05

JD.


1 Answers

I was suspicious of the accuracy of the timeits, so I tried a similar approach, scaling up the DataFrame size a little bit :

rows = 10000000
cols = 100
df = pd.DataFrame(
    data=np.random.rand(rows, cols),
    index=pd.date_range(
        start='2014-01-01',
        periods=rows,
        freq='min'
    ),
)

At first I could replicate your issue :

%timeit pd.read_pickle(F)
2.67 s ± 362 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.read_pickle(F).loc[pd.to_datetime('20200101 00:00'):]
2.51 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But it turns out we're simply being fooled by (I assume) processor caches ; the faster operation of the two is always the latest ; here I run the slicing one before the direct load :

%timeit pd.read_pickle(F).loc[pd.to_datetime('20200101 00:00'):]
2.87 s ± 879 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.read_pickle(F)
2.64 s ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If we run the timeits a couple times more, we can see the expected result (the slicing one being just a little bit slower) :

%timeit pd.read_pickle(F).loc[pd.to_datetime('20200101 00:00'):]
2.66 s ± 87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.read_pickle(F)
2.63 s ± 70.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 121
user12932966 Avatar answered May 26 '26 02:05

user12932966