from datetime import datetime
from pandas.io.data import DataReader
from numpy import cumsum, log, polyfit, sqrt, std, subtract
from numpy.random import randn
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
# Here it calculates the variances, but why it uses
# standard deviation and then make a root of it?
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
# Download the stock prices series from Yahoo
aapl = DataReader("AAPL", "yahoo", datetime(2012,1,1), datetime(2015,9,18))
# Call the function
hurst(aapl['Adj Close'])
From this code for estimating Hurst Exponent, when we want to calculate the variance of the lagged difference, why we still use a standard deviation and take a square root? I am confused for a long time, and I don't know why others don't have the same confuse. Do I misunderstand the math behind it? Thanks!
All that is going on here is a variation on math notation
I'll define
d = subtract(ts[lag:], ts[:-lag])
Then it is clear that
np.log(np.std(d)**2) == np.log(np.var(d))
np.log(np.std(d)) == .5*np.log(np.var(d))
Then you have the equivalence
2*np.log(np.sqrt(np.std(d))) == .5*np.log(np.sqrt(np.var(d)))
The functional output of polyfit
scales proportionally to its input
I'm just as confused. I don't understand where the sqrt of std comes from either, and have spent 3 days trying to figure it out. In the end I noticed QuantStart credits Dr Tom Starke who uses a slightly different code. Dr Tom Starke credits Dr Ernie Chan, and going to his blog. I was able to find enough information to put together my own code from his principles. This doesn't use sqrt, uses variance instead of std and uses a 2.0 divisor at the end instead of a 2.0 multiplier. In the end, it seems to give the same results as the quantstart code you post, but I am able to understand it from first principles, which I guess is important. I put together a Jupyter Notebook which makes it clearer, but I'm not sure if I can post that here, so I will try to explain as best I can here. Code is pasted first, then an explanation.
lags = range(2,100)
def hurst_ernie_chan(p):
variancetau = []; tau = []
for lag in lags:
# Write the different lags into a vector to compute a set of tau or lags
tau.append(lag)
# Compute the log returns on all days, then compute the variance on the difference in log returns
# call this pp or the price difference
pp = subtract(p[lag:], p[:-lag])
variancetau.append(var(pp))
# we now have a set of tau or lags and a corresponding set of variances.
#print tau
#print variancetau
# plot the log of those variance against the log of tau and get the slope
m = polyfit(log10(tau),log10(variancetau),1)
hurst = m[0] / 2
return hurst
Dr Chan doesn't give any code on this page (I believe he works in MATLAB not Python anyway). Hence I needed to put together my own code from the notes he gives in his blog and answers he gives to questions posed on his blog.
Dr Chan states that if z is the log price, then volatility, sampled at intervals of τ, is volatility(τ)=√(Var(z(t)-z(t-τ))). To me another way of describing volatility is standard deviation, so std(τ)=√(Var(z(t)-z(t-τ)))
std is just the root of variance so var(τ)=(Var(z(t)-z(t-τ)))
Dr Chan then states: In general, we can write Var(τ) ∝ τ^(2H) where H is the Hurst exponent
Hence (Var(z(t)-z(t-τ))) ∝ τ^(2H)
Taking the log of each side we get log (Var(z(t)-z(t-τ))) ∝ 2H log τ
[ log (Var(z(t)-z(t-τ))) / log τ ] / 2 ∝ H (gives the Hurst exponent) where we know the term in square brackets on far left is the slope of a log-log plot of tau and a corresponding set of variances.
If you run that function and compare the answers to the Quantstart function, they should be the same. Not sure if that helped.
As per intuitive definition taken from Ernest Chan's "Algorithmic trading" (p.44):
Intuitively speaking, a “stationary” price series means that the prices diffuse from its initial value more slowly than a geometric random walk would.
one would want to check variance of time series with increasing lags against lag(s). This is because for normal distribution -- and log prices are believed to be normal (to certain extent) -- variance of sum of normal distributions is a sum of constituents' variances.
As per Ernest Chan's citation, for mean reverting processes the realized variance will be less than theoretically projected.
Putting this in code:
def hurst(p, l):
"""
Arguments:
p: ndarray -- the price series to be tested
l: list of integers or an integer -- lag(s) to test for mean reversion
Returns:
Hurst exponent
"""
if isinstance(l, int):
lags = [1, l]
else:
lags = l
assert lags[-1] >=2, "Lag in prices must be greater or equal 2"
print(f"Price lags of {lags[1:]} are included")
lp = np.log(p)
var = [np.var(lp[l:] - lp[:-l]) for l in lags]
hr = linregress(np.log(lags), np.log(var))[0] / 2
return hr
The code posted by OP is correct.
The reason for the confusion is that it does a square-root first, and then counters it by multiplying the slope (returned by polyfit) with 2.
For a more detailed explanation, continue reading.
tau is calculated with an "extra" square-root. Then, its log is calculated. log(sqrt(x)) = log(x^0.5) = 0.5*log(x) (this is the key). polyfit now conducts the fitting with y multiplied by an "extra 0.5". So, the result obtained is also multiplied by nearly 0.5. Returning twice of that (return poly[0]*2.0) counters the initial (seemingly) extra 0.5.
Hope this makes it clearer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With