Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sparseIndexTracking 0.1.0 failure in if(): missing value where TRUE/FALSE needed

Tags:

r

na

zoo

xts

I have two datasets. Both are xts objects.

 > dput(head(all_data[,2:3]))
    structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658, 
    0.00267952840300101, 0.00606980273141122, 0.00301659125188536, 
    0.00526315789473686, -0.00149588631264019, 0, -0.00299625468164799
    ), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
    ), .indexTZ = "UTC", tclass = c("POSIXct", "POSIXt"), tzone = "UTC", index = structure(c(1453716060, 
    1453716120, 1453716180, 1453716240, 1453716300, 1453716360), tzone = "UTC", tclass = c("POSIXct", 
    "POSIXt")), .Dim = c(6L, 2L), .Dimnames = list(NULL, c("ClosePrice_AGL.1", 
    "ClosePrice_AMC")))

> dput(head(all_data[,1]))
    structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658, 
    0.00267952840300101), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", 
    "POSIXt"), .indexTZ = "UTC", tclass = c("POSIXct", "POSIXt"), tzone = "UTC", index = structure(c(1453716060, 
    1453716120, 1453716180, 1453716240, 1453716300, 1453716360), tzone = "UTC", tclass = c("POSIXct", 
    "POSIXt")), .Dim = c(6L, 1L), .Dimnames = list(NULL, "ClosePrice_AGL"))

> dput(head(mydata_train[,1:3]))
    structure(c(-0.00155763239875384, -0.0279251170046803, -0.00225324987963404, 
    -0.000479333950998528, 0.0042195179257094, -0.00163456299477571, 
    -0.00526315789473697, -0.0222222222222221, -0.00431818181818178, 
    -0.00218475886131686, 0.00217864923747269, -0.00217391304347825, 
    -0.00651612903225807, -0.0221442950840964, -0.00385177314384377, 
    0.00333333333333319, -0.00365448504983379, -0.0160053351117039
    ), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
    ), tclass = c("POSIXct", "POSIXt"), tzone = "", index = structure(c(1527255180, 
    1527256080, 1527256260, 1527256440, 1527256800, 1527256980), tclass = c("POSIXct", 
    "POSIXt")), .Dim = c(6L, 3L), .Dimnames = list(NULL, c("ACBFF.Close", 
    "APHQF.Close", "WDDMF.Close")))

> dput(head(mydata_train[,4]))
    structure(c(0.00429610046265694, -0.00789733464955589, -0.00165837479270303, 
    -0.00299003322259139, 0.00333222259246901, -0.00199269345732311
    ), class = c("xts", "zoo"), .indexCLASS = c("POSIXct", "POSIXt"
    ), tclass = c("POSIXct", "POSIXt"), tzone = "", index = structure(c(1527255180, 
    1527256080, 1527256260, 1527256440, 1527256800, 1527256980), tclass = c("POSIXct", 
    "POSIXt")), .Dim = c(6L, 1L), .Dimnames = list(NULL, "MJ.Close"))

and I am running spIndexTrack from:

library(sparseIndexTracking)

  test <- spIndexTrack(all_data[,2:3] , all_data[,1], lambda = 1e-7, u = 0.5, measure = 'ete')

  test <- spIndexTrack(mydata_train[,1:3] , mydata_train[,4], lambda = 1e-7, u = 0.5, measure = 'ete')

The second function gives:

 w
ACBFF.Close 0.47083543
APHQF.Close 0.42967200
WDDMF.Close 0.09949257

but the first fails:

Error in if (abs(a + 1) < 1e-06) { : 
  missing value where TRUE/FALSE needed

I have no NAs

all_data <- all_data[complete.cases(all_data),]

any(is.na(all_data) == TRUE)

and all my data is numeric.

storage.mode(my_data) <- "numeric"

I can run a regression through with no errors:

lm(all_data[,1] ~ all_data[,2:3]) 

It's not a result of having 0s in my data frame

all_data[all_data==0] <- 1e-9

Tried wrapping as matrix:

as.matrix(all_data)

Don't know what has gone wrong.


If anyone wants to run the full working example with online google/yahoo data you can with:

library(sparseIndexTracking)
library(xts)
library(gquote)
library(PerformanceAnalytics)

#######################################
############  SET PARAMETERS  #########
#######################################

# Data

minute_interval <- 3
n_periods <- 10000




#######################################
############  GET DATA  #########
#######################################


# pull yahoo / google data for the portfolio (2 stocks)

mydata <- merge(getIntradayPrice('ACBFF', period=n_periods, interval = minute_interval),
                getIntradayPrice('APHQF', period=n_periods, interval = minute_interval),
                getIntradayPrice('WDDMF', period=n_periods, interval = minute_interval),
                getIntradayPrice('MJ', period=n_periods, interval = minute_interval),
                getIntradayPrice('HMLSF', period=n_periods, interval = minute_interval)
)

#select just closing prices
mydata <- mydata[,c(1,6, 11, 16)]


# remove NA values

mydata <- mydata[complete.cases(mydata),]

# replace all with returns of the two series - can use 'log' or 'discrete'
mydata <- Return.calculate(mydata, method = 'discrete')


# remove NA values again

mydata <- mydata[complete.cases(mydata),]

## split set into first 50% training data second 50% test data

mydata_train <- mydata[1:floor(nrow(mydata) * 0.5),]
mydata_test <-  mydata[floor(nrow(mydata) * 0.5 +1):nrow(mydata),]


# remove NA values again

mydata_train <- mydata_train[complete.cases(mydata_train),]

# Generate weights see : https://cran.r-project.org/web/packages/sparseIndexTracking/vignettes/SparseIndexTracking-vignette.pdf



w_ete <- spIndexTrack(mydata_train[,1:3] , mydata_train[,4], lambda = 1e-7, u = 1.5, measure = 'ete')
w_ete

I'm stuck. Not sure if anyone can help. Thanks in advance.

like image 561
Alex Bădoi Avatar asked Jul 26 '18 20:07

Alex Bădoi


1 Answers

Data preparation

spIndexTrack first applies as.matrix on your input object as the function expects a matrix and a vector

 spIndexTrack(X, r, lambda, u = 1, measure = c("ete", "dr", "hete", "hdr"),
   hub = NULL, w0 = NULL, thres = 1e-09)

Arguments:

   X: m-by-n matrix of net returns (m samples, n assets).

   r: m dimensional vector of the net returns of the index.

So for simplicity you can just dput(datamat <- as.matrix(all_data)), which is

datamat <- 
structure(c(0.00108166576527857, 0.00324149108589955, 0, 0, 0.00484652665589658, 
0.00267952840300101, 0.00108166576527857, 0.00324149108589955, 
0, 0, 0.00484652665589658, 0.00267952840300101, 0.00606980273141122, 
0.00301659125188536, 0.00526315789473686, -0.00149588631264019, 
0, -0.00299625468164799), .Dim = c(6L, 3L), .Dimnames = list(
c("2016-01-25 10:01:00", "2016-01-25 10:02:00", "2016-01-25 10:03:00", 
"2016-01-25 10:04:00", "2016-01-25 10:05:00", "2016-01-25 10:06:00"
), c("ClosePrice_AGL", "ClosePrice_AGL.1", "ClosePrice_AMC"
)))

Then set

X <- datamat[, 2:3]
r <- datamat[, 1]

Identify the problem

Note that X[, 1] and r are identical:

identical(X[, 1], r)
#[1] TRUE

You can run an lm(r ~ 0 + X), but as r can be perfectly explained by X[, 1], coefficient for X[, 2] ends up being zero:

XClosePrice_AGL.1    XClosePrice_AMC  
        1.000e+00         -6.346e-19  

According to SparseIndexTracking vignette: explanation of the algorithms, spIndexTrack is doing a contrained least squares with lasso-alike regularization.

  • Regression coefficients must be non-negative and sum up to 1, so that they give a valid portfolio;
  • Regularization imposes a sparse variable selection.

For the above X and r, the non-contrained ordinary least squares already gives you the best sparse portfolio: (1, 0). What else would you expect spIndexTrack to do?

## your call to `spIndexTrack`
spIndexTrack(X, r, lambda = 1e-7, u = 0.5, measure = 'ete')

In particular, you set u = 0.5, which requires any portfolio weight to be no larger than 0.5. Given that the optimal weight vector are (1, 0), the algorithm might have a big struggle.

I think you'd better not change the default value u = 1. This default means that the algorithm is able to shrink all but one feature to zero.

Now even the following would fail.

spIndexTrack(X, r, lambda = 1e-7, u = 1, measure = 'ete')

Therefore I plan to set the regularization to zero, i.e., lambda = 0, but it still fails. I have to set it to a tiny negative to make it work:

spIndexTrack(X, r, lambda = -1e-16, u = 1, measure = 'ete')
#                            w
#ClosePrice_AGL.1 9.999996e-01
#ClosePrice_AMC   3.633128e-07

And as you can see, the result is close to (1, 0).


Lack of numerical stability of the package

I checked this package on CRAN. As of today (2018-07-30) it is still in its first release (version 0.1.0).

By glancing the R code of spIndexTrack I found that basically no test for numerical stability is made. The error you get is just because you get an 0 / 0 hence an NaN for variable a.

I am not interested to go through the mathematical algorithms to think about what prior numerical tests should be made. The following suggestion is the most robust, but probably too restricted, but the other working example in your question (associated with mydata_train) satisfies this condition.

  • X has full rank, that is, qr(X)$rank equals to ncol(X);
  • cbind(X, r) has full rank, that is, lm(r ~ 0 + X) does not end up with all-zero residuals.

It is the package author's responsibility to consider this. But in the end, the function should first check for numerical failure and return an early informative error.

like image 140
Zheyuan Li Avatar answered Oct 15 '22 12:10

Zheyuan Li