Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pct_change and log returns differ from actual values

I'm working on a dataframe with prices. I find the returns calculated arithmetic or log are different than actual return between first price value and the last. As I see it they should be the same or differ by small fractions.

dfset.head()
                       Open   Close    High     Low      Volume
Date_utc                                                       
2017-12-01 00:00:00  432.01  434.56  435.09  432.01  781.788110
2017-12-01 00:05:00  434.25  435.82  436.98  434.25  584.017105
2017-12-01 00:10:00  435.81  435.50  436.39  434.80  494.047392
2017-12-01 00:15:00  435.88  435.10  436.07  434.50  527.840340
2017-12-01 00:20:00  434.51  433.50  434.95  432.98  458.557971


dfset.tail()
                       Open   Close    High     Low       Volume
Date_utc                                                        
2017-12-21 23:40:00  781.41  781.01  783.46  778.12   792.433089
2017-12-21 23:45:00  779.60  784.76  784.90  778.20   657.316066
2017-12-21 23:50:00  784.83  783.42  784.90  782.22   473.108867
2017-12-21 23:55:00  783.40  786.98  787.00  782.62  1492.764405
2017-12-22 00:00:00  786.96  791.93  792.00  786.86  1745.559100

when calculating returns either by:

dfset['Close'].pct_change().sum()
0.694478597676

or using log returns:

np.log(dfset['Close'] / dfset['Close'].shift(1)).sum()
0.60013897914

Actual overall return which I consider to be the correct one:

dfset['Close'].iloc[len(dfset) - 1] / dfset['Close'].iloc[0] - 1
0.822372054492

Any ideas please why the arithmetic and log returns are off?

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.1
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0
None
like image 454
OctavianD Avatar asked Feb 27 '18 09:02

OctavianD


1 Answers

I think that the 3 operations are quite different. I will take only the tail to show.

In the first place:

print( dfset['Close'].pct_change()) 

2017-12-21         NaN
2017-12-21    0.004801
2017-12-21   -0.001708
2017-12-21    0.004544
2017-12-22    0.006290
Name: Close, dtype: float64

is equivalent to do:

print(dfset['Close'].diff()/dfset['Close'].shift(1))

2017-12-21         NaN
2017-12-21    0.004801
2017-12-21   -0.001708
2017-12-21    0.004544
2017-12-22    0.006290
Name: Close, dtype: float64

So their sums are equal:

print((dfset['Close'].diff()/dfset['Close'].shift(1)).sum())
0.013927992282837915

Then I don't see the point of:

np.log(dfset['Close'] / dfset['Close'].shift(1))

being equal to the pct_change.

print(np.log(dfset['Close'] / dfset['Close'].shift(1)))

2017-12-21         NaN
2017-12-21    0.004790
2017-12-21   -0.001709
2017-12-21    0.004534
2017-12-22    0.006270
Name: Close, dtype: float64

The result is similar since there is no subtraction of 1 and no exponential. But this does not make it correct mathematically.

Normally, to avoid divisions I would take logarithms and subtract them and then make the exponential back. In any case, to replicate pct_change:

print(np.log((dfset['Close'] / dfset['Close'].shift(1))-1).apply(np.exp))
2017-12-21         NaN
2017-12-21    0.004801
2017-12-21         NaN
2017-12-21    0.004544
2017-12-22    0.006290
Name: Close, dtype: float64

print((np.log(dfset['Close'].diff()) -  np.log(dfset['Close'].shift(1))).apply(np.exp))

2017-12-21         NaN
2017-12-21    0.004801
2017-12-21         NaN
2017-12-21    0.004544
2017-12-22    0.006290
Name: Close, dtype: float64

In any case, using logarithm will return NaN for negative values.

So the sum of the elements is different to the use of pct_change:

print((np.log(dfset['Close'].diff()) -  np.log(dfset['Close'].shift(1))).apply(np.exp).sum())

0.015635520699169063

Finally, the last one matches the first (note, instead of using .iloc[len(dfset) - 1] to find the last element, you can do .iloc[- 1] ):

print(dfset['Close'].iloc[-1] / dfset['Close'].iloc[0] - 1)

0.013981895238217135

There is a difference in the 5th decimal between first approach and this one (of a 4% with respect to the first or in absolute terms 5.390295537921995e-05), but this differences may be due to precision issues happening in when storing floats.

EDITED: PLOTTING THE COMPOUND INTEREST

You explained in your comments that you want to plot the cumsum and that is what differs from the total change dfset['Close'].iloc[-1] / dfset['Close'].iloc[0] - 1.

The reason behind is that the cumulative sum of percent change in the range of dates is not equal to the percent change between the first element and the last of the interval.

To do so you have to use the compound interest, which is a formula to calculate the total increment when there are continuous changes between time steps. This way, using the csv from your comment you will match the change of between the first and last day by doing:

print(((dfset['Close'].pct_change(axis=0)+1).cumprod()-1).iloc[-1])

0.8223720544918787

import matplotlib.pyplot as plt
((dfset['Close'].pct_change(axis=0)+1).cumprod()-1).plot()
plt.show()

enter image description here

like image 78
Mabel Villalba Avatar answered Oct 22 '22 12:10

Mabel Villalba