Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Pandas so madly fast? How to define such functions?

  • I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.

My code:

#df_1h has been imported before

import time

n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))

close = df_1h['close']

start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))

start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
    df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))

display(df_1h.tail())

Output:

"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
        open_time       open        high        low         ... ignore  rsi_14  sma_14_pandas   sma_14_loop
16593   1.562980e+12    11707.39    11739.90    11606.04    ... 0.0 51.813151   11646.625714    11646.625714
16594   1.562983e+12    11664.32    11712.61    11625.00    ... 0.0 49.952679   11646.834286    11646.834286
16595   1.562987e+12    11632.64    11686.47    11510.00    ... 0.0 47.583619   11643.321429    11643.321429
16596   1.562990e+12    11582.06    11624.04    11500.00    ... 0.0 48.725262   11644.912857    11644.912857
16597   1.562994e+12    11604.96    11660.00    11588.16    ... 0.0 50.797087   11656.723571    11656.723571
5 rows × 15 columns
  • Pandas almost faster than 2.5k times!!!

My Questions:

  • Is my code wrong?
  • If my code is correct, why is Pandas so fast?
  • How to define custom functions that run so fast for Pandas?
like image 833
Thai D. V. Avatar asked Jul 13 '19 10:07

Thai D. V.


People also ask

Why are pandas so fast?

Pandas keeps track of data types, indexes and performs error checking — all of which are very useful, but also slow down the calculations. NumPy doesn't do any of that, so it can perform the same calculations significantly faster. There are multiple ways to convert Pandas data to NumPy.

Why is pandas so much faster than Excel?

Because it is built on NumPy (Numerical Python), Pandas boasts several advantages over Excel: Scalability - Pandas is only limited by hardware and can manipulate larger quantities of data. Speed - Pandas is much faster than Excel, which is especially noticeable when working with larger quantities of data.

Why are pandas complicated?

Pandas is Powerful but Difficult to use Some reasons for this include: There are often multiple ways to complete common tasks. There are over 240 DataFrame attributes and methods. There are several methods that are aliases (reference the same exact underlying code) of each other.


1 Answers

As to your three questions:

  1. Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
  2. Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
  3. Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the value attribute or to_numpy() method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.

Why is the loop algorithm so slow?

In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.

You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:

import pandas as pd
import numpy as np
import time

data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))

start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))

start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
    df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop:   {}'.format(time.time() - start))

def np_sma(a, n=14) :
    ret = np.cumsum(a)
    ret[n:] = ret[n:] - ret[:-n]
    return np.append([np.nan]*(n-1), ret[n-1:] / n)

start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np:     {}'.format(time.time() - start))

assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)

Output:

df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop:   7.605962753295898
np:     0.0010571479797363281
like image 188
Stef Avatar answered Oct 13 '22 12:10

Stef