I'm working on a dataframe with prices. I find the returns calculated arithmetic or log are different than actual return between first price value and the last. As I see it they should be the same or differ by small fractions.
dfset.head()
Open Close High Low Volume
Date_utc
2017-12-01 00:00:00 432.01 434.56 435.09 432.01 781.788110
2017-12-01 00:05:00 434.25 435.82 436.98 434.25 584.017105
2017-12-01 00:10:00 435.81 435.50 436.39 434.80 494.047392
2017-12-01 00:15:00 435.88 435.10 436.07 434.50 527.840340
2017-12-01 00:20:00 434.51 433.50 434.95 432.98 458.557971
dfset.tail()
Open Close High Low Volume
Date_utc
2017-12-21 23:40:00 781.41 781.01 783.46 778.12 792.433089
2017-12-21 23:45:00 779.60 784.76 784.90 778.20 657.316066
2017-12-21 23:50:00 784.83 783.42 784.90 782.22 473.108867
2017-12-21 23:55:00 783.40 786.98 787.00 782.62 1492.764405
2017-12-22 00:00:00 786.96 791.93 792.00 786.86 1745.559100
when calculating returns either by:
dfset['Close'].pct_change().sum()
0.694478597676
or using log returns:
np.log(dfset['Close'] / dfset['Close'].shift(1)).sum()
0.60013897914
Actual overall return which I consider to be the correct one:
dfset['Close'].iloc[len(dfset) - 1] / dfset['Close'].iloc[0] - 1
0.822372054492
Any ideas please why the arithmetic and log returns are off?
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.21.1
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.5.0
None
I think that the 3 operations are quite different. I will take only the tail to show.
In the first place:
print( dfset['Close'].pct_change())
2017-12-21 NaN
2017-12-21 0.004801
2017-12-21 -0.001708
2017-12-21 0.004544
2017-12-22 0.006290
Name: Close, dtype: float64
is equivalent to do:
print(dfset['Close'].diff()/dfset['Close'].shift(1))
2017-12-21 NaN
2017-12-21 0.004801
2017-12-21 -0.001708
2017-12-21 0.004544
2017-12-22 0.006290
Name: Close, dtype: float64
So their sums are equal:
print((dfset['Close'].diff()/dfset['Close'].shift(1)).sum())
0.013927992282837915
Then I don't see the point of:
np.log(dfset['Close'] / dfset['Close'].shift(1))
being equal to the pct_change
.
print(np.log(dfset['Close'] / dfset['Close'].shift(1)))
2017-12-21 NaN
2017-12-21 0.004790
2017-12-21 -0.001709
2017-12-21 0.004534
2017-12-22 0.006270
Name: Close, dtype: float64
The result is similar since there is no subtraction of 1 and no exponential. But this does not make it correct mathematically.
Normally, to avoid divisions I would take logarithms and subtract them and then make the exponential back. In any case, to replicate
pct_change
:
print(np.log((dfset['Close'] / dfset['Close'].shift(1))-1).apply(np.exp))
2017-12-21 NaN
2017-12-21 0.004801
2017-12-21 NaN
2017-12-21 0.004544
2017-12-22 0.006290
Name: Close, dtype: float64
print((np.log(dfset['Close'].diff()) - np.log(dfset['Close'].shift(1))).apply(np.exp))
2017-12-21 NaN
2017-12-21 0.004801
2017-12-21 NaN
2017-12-21 0.004544
2017-12-22 0.006290
Name: Close, dtype: float64
In any case, using logarithm will return NaN for negative values.
So the sum of the elements is different to the use of pct_change:
print((np.log(dfset['Close'].diff()) - np.log(dfset['Close'].shift(1))).apply(np.exp).sum())
0.015635520699169063
Finally, the last one matches the first (note, instead of using .iloc[len(dfset) - 1]
to find the last element, you can do .iloc[- 1]
):
print(dfset['Close'].iloc[-1] / dfset['Close'].iloc[0] - 1)
0.013981895238217135
There is a difference in the 5th decimal between first approach and this one (of a 4% with respect to the first or in absolute terms 5.390295537921995e-05), but this differences may be due to precision issues happening in when storing floats.
EDITED: PLOTTING THE COMPOUND INTEREST
You explained in your comments that you want to plot the cumsum
and that is what differs from the total change dfset['Close'].iloc[-1] / dfset['Close'].iloc[0] - 1
.
The reason behind is that the cumulative sum of percent change in the range of dates is not equal to the percent change between the first element and the last of the interval.
To do so you have to use the compound interest, which is a formula to calculate the total increment when there are continuous changes between time steps. This way, using the csv from your comment you will match the change of between the first and last day by doing:
print(((dfset['Close'].pct_change(axis=0)+1).cumprod()-1).iloc[-1])
0.8223720544918787
import matplotlib.pyplot as plt
((dfset['Close'].pct_change(axis=0)+1).cumprod()-1).plot()
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With