I have two datasets. Both are xts
objects.
> dput(head(all_data[,2:3]))
structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658,
0.00267952840300101, 0.00606980273141122, 0.00301659125188536,
0.00526315789473686, -0.00149588631264019, 0, -0.00299625468164799
), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
), .indexTZ = "UTC", tclass = c("POSIXct", "POSIXt"), tzone = "UTC", index = structure(c(1453716060,
1453716120, 1453716180, 1453716240, 1453716300, 1453716360), tzone = "UTC", tclass = c("POSIXct",
"POSIXt")), .Dim = c(6L, 2L), .Dimnames = list(NULL, c("ClosePrice_AGL.1",
"ClosePrice_AMC")))
> dput(head(all_data[,1]))
structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658,
0.00267952840300101), class = c("xts", "zoo"), .indexCLASS = c("POSIXct",
"POSIXt"), .indexTZ = "UTC", tclass = c("POSIXct", "POSIXt"), tzone = "UTC", index = structure(c(1453716060,
1453716120, 1453716180, 1453716240, 1453716300, 1453716360), tzone = "UTC", tclass = c("POSIXct",
"POSIXt")), .Dim = c(6L, 1L), .Dimnames = list(NULL, "ClosePrice_AGL"))
> dput(head(mydata_train[,1:3]))
structure(c(-0.00155763239875384, -0.0279251170046803, -0.00225324987963404,
-0.000479333950998528, 0.0042195179257094, -0.00163456299477571,
-0.00526315789473697, -0.0222222222222221, -0.00431818181818178,
-0.00218475886131686, 0.00217864923747269, -0.00217391304347825,
-0.00651612903225807, -0.0221442950840964, -0.00385177314384377,
0.00333333333333319, -0.00365448504983379, -0.0160053351117039
), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
), tclass = c("POSIXct", "POSIXt"), tzone = "", index = structure(c(1527255180,
1527256080, 1527256260, 1527256440, 1527256800, 1527256980), tclass = c("POSIXct",
"POSIXt")), .Dim = c(6L, 3L), .Dimnames = list(NULL, c("ACBFF.Close",
"APHQF.Close", "WDDMF.Close")))
> dput(head(mydata_train[,4]))
structure(c(0.00429610046265694, -0.00789733464955589, -0.00165837479270303,
-0.00299003322259139, 0.00333222259246901, -0.00199269345732311
), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
), tclass = c("POSIXct", "POSIXt"), tzone = "", index = structure(c(1527255180,
1527256080, 1527256260, 1527256440, 1527256800, 1527256980), tclass = c("POSIXct",
"POSIXt")), .Dim = c(6L, 1L), .Dimnames = list(NULL, "MJ.Close"))
and I am running spIndexTrack
from:
library(sparseIndexTracking)
test <- spIndexTrack(all_data[,2:3] , all_data[,1], lambda = 1e-7, u = 0.5, measure = 'ete')
test <- spIndexTrack(mydata_train[,1:3] , mydata_train[,4], lambda = 1e-7, u = 0.5, measure = 'ete')
The second function gives:
w
ACBFF.Close 0.47083543
APHQF.Close 0.42967200
WDDMF.Close 0.09949257
but the first fails:
Error in if (abs(a + 1) < 1e-06) { :
missing value where TRUE/FALSE needed
I have no NA
s
all_data <- all_data[complete.cases(all_data),]
any(is.na(all_data) == TRUE)
and all my data is numeric.
storage.mode(my_data) <- "numeric"
I can run a regression through with no errors:
lm(all_data[,1] ~ all_data[,2:3])
It's not a result of having 0s in my data frame
all_data[all_data==0] <- 1e-9
Tried wrapping as matrix:
as.matrix(all_data)
Don't know what has gone wrong.
If anyone wants to run the full working example with online google/yahoo data you can with:
library(sparseIndexTracking)
library(xts)
library(gquote)
library(PerformanceAnalytics)
#######################################
############ SET PARAMETERS #########
#######################################
# Data
minute_interval <- 3
n_periods <- 10000
#######################################
############ GET DATA #########
#######################################
# pull yahoo / google data for the portfolio (2 stocks)
mydata <- merge(getIntradayPrice('ACBFF', period=n_periods, interval = minute_interval),
getIntradayPrice('APHQF', period=n_periods, interval = minute_interval),
getIntradayPrice('WDDMF', period=n_periods, interval = minute_interval),
getIntradayPrice('MJ', period=n_periods, interval = minute_interval),
getIntradayPrice('HMLSF', period=n_periods, interval = minute_interval)
)
#select just closing prices
mydata <- mydata[,c(1,6, 11, 16)]
# remove NA values
mydata <- mydata[complete.cases(mydata),]
# replace all with returns of the two series - can use 'log' or 'discrete'
mydata <- Return.calculate(mydata, method = 'discrete')
# remove NA values again
mydata <- mydata[complete.cases(mydata),]
## split set into first 50% training data second 50% test data
mydata_train <- mydata[1:floor(nrow(mydata) * 0.5),]
mydata_test <- mydata[floor(nrow(mydata) * 0.5 +1):nrow(mydata),]
# remove NA values again
mydata_train <- mydata_train[complete.cases(mydata_train),]
# Generate weights see : https://cran.r-project.org/web/packages/sparseIndexTracking/vignettes/SparseIndexTracking-vignette.pdf
w_ete <- spIndexTrack(mydata_train[,1:3] , mydata_train[,4], lambda = 1e-7, u = 1.5, measure = 'ete')
w_ete
I'm stuck. Not sure if anyone can help. Thanks in advance.
Data preparation
spIndexTrack
first applies as.matrix
on your input object as the function expects a matrix and a vector
spIndexTrack(X, r, lambda, u = 1, measure = c("ete", "dr", "hete", "hdr"),
hub = NULL, w0 = NULL, thres = 1e-09)
Arguments:
X: m-by-n matrix of net returns (m samples, n assets).
r: m dimensional vector of the net returns of the index.
So for simplicity you can just dput(datamat <- as.matrix(all_data))
, which is
datamat <-
structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658,
0.00267952840300101, 0.00108166576527857, 0.00324149108589955,
0, 0, 0.00484652665589658, 0.00267952840300101, 0.00606980273141122,
0.00301659125188536, 0.00526315789473686, -0.00149588631264019,
0, -0.00299625468164799), .Dim = c(6L, 3L), .Dimnames = list(
c("2016-01-25 10:01:00", "2016-01-25 10:02:00", "2016-01-25 10:03:00",
"2016-01-25 10:04:00", "2016-01-25 10:05:00", "2016-01-25 10:06:00"
), c("ClosePrice_AGL", "ClosePrice_AGL.1", "ClosePrice_AMC"
)))
Then set
X <- datamat[, 2:3]
r <- datamat[, 1]
Identify the problem
Note that X[, 1]
and r
are identical:
identical(X[, 1], r)
#[1] TRUE
You can run an lm(r ~ 0 + X)
, but as r
can be perfectly explained by X[, 1]
, coefficient for X[, 2]
ends up being zero:
XClosePrice_AGL.1 XClosePrice_AMC
1.000e+00 -6.346e-19
According to SparseIndexTracking vignette: explanation of the algorithms, spIndexTrack
is doing a contrained least squares with lasso-alike regularization.
For the above X
and r
, the non-contrained ordinary least squares already gives you the best sparse portfolio: (1, 0)
. What else would you expect spIndexTrack
to do?
## your call to `spIndexTrack`
spIndexTrack(X, r, lambda = 1e-7, u = 0.5, measure = 'ete')
In particular, you set u = 0.5
, which requires any portfolio weight to be no larger than 0.5
. Given that the optimal weight vector are (1, 0)
, the algorithm might have a big struggle.
I think you'd better not change the default value u = 1
. This default means that the algorithm is able to shrink all but one feature to zero.
Now even the following would fail.
spIndexTrack(X, r, lambda = 1e-7, u = 1, measure = 'ete')
Therefore I plan to set the regularization to zero, i.e., lambda = 0
, but it still fails. I have to set it to a tiny negative to make it work:
spIndexTrack(X, r, lambda = -1e-16, u = 1, measure = 'ete')
# w
#ClosePrice_AGL.1 9.999996e-01
#ClosePrice_AMC 3.633128e-07
And as you can see, the result is close to (1, 0)
.
Lack of numerical stability of the package
I checked this package on CRAN. As of today (2018-07-30) it is still in its first release (version 0.1.0).
By glancing the R code of spIndexTrack
I found that basically no test for numerical stability is made. The error you get is just because you get an 0 / 0
hence an NaN
for variable a
.
I am not interested to go through the mathematical algorithms to think about what prior numerical tests should be made. The following suggestion is the most robust, but probably too restricted, but the other working example in your question (associated with mydata_train
) satisfies this condition.
X
has full rank, that is, qr(X)$rank
equals to ncol(X)
;cbind(X, r)
has full rank, that is, lm(r ~ 0 + X)
does not end up with all-zero residuals.It is the package author's responsibility to consider this. But in the end, the function should first check for numerical failure and return an early informative error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With