I have some high dimensional repeated measures data, and i am interested in fitting random forest model to investigate the suitability and predictive utility of such models. Specifically i am trying to implement the methods in the LongituRF
package. The methods behind this package are detailed here :
Capitaine, L., et al. Random forests for high-dimensional longitudinal data. Stat Methods Med Res (2020) doi:10.1177/0962280220946080.
Conveniently the authors provide some useful data generating functions for testing. So we have
install.packages("LongituRF")
library(LongituRF)
Let's generate some data with DataLongGenerator()
which takes as arguments n=sample size, p=number of predictors and G=number of predictors with temporal behavior.
my_data <- DataLongGenerator(n=50,p=6,G=6)
my_data
is a list of what you'd expect Y (response vector),
X (matrix of fixed effects predictors), Z (matrix of random-effects predictors),
id (vector of sample identifier) and time (vector of time measurements). To fit random forest model simply
model <- REEMforest(X=my_data$X,Y=my_data$Y,Z=my_data$Z,time=my_data$time,
id=my_data$id,sto="BM",mtry=2)
takes about 50secs here so bear with me
so far so good. Now im clear about all the parameters here except for Z
. What is Z
when i go to fit this model on my actual data?
Looking at my_data$Z
.
dim(my_data$Z)
[1] 471 2
head(my_data$Z)
[,1] [,2]
[1,] 1 1.1128914
[2,] 1 1.0349287
[3,] 1 0.7308948
[4,] 1 1.0976203
[5,] 1 1.3739856
[6,] 1 0.6840415
Each row of looks like an intercept term (i.e. 1) and values drawn from a uniform distribution runif()
.
The documentation of REEMforest()
indicates that "Z [matrix]: A Nxq matrix containing the q predictor of the random effects." How is this matrix to be specified when using actual data?
My understanding is that traditionally Z is simply one-hot (binary) encoding of the group variables (e.g. as described here), so Z
from the DataLongGenerator()
should be nxG (471x6) sparse matrix no?
Clarity on how to specify the Z
parameter with actual data would be appreciated.
EDIT
My specific example is as follows, i have a response variable (Y
). Samples (identified with id
) were randomly assigned to intervention (I
, intervention or no intervention). A high dimensional set of features (X
). Features and response were measured at two timepoints (Time
, baseline and endpoint). I am interested in predicting Y
, using X
and I
. I am also interested in extracting which features were most important to predicting Y
(the same way Capitaine et al. did with HIV in their paper).
I will call REEMforest()
as follows
REEMforest(X=cbind(X,I), Y=Y, time=Time, id=id)
What should i use for Z
?
We will discuss Random Forest in R example to understand the concept even better-- When we are going to buy any elite or costly items like Car, Home or any investment in the share market then we prefer to take multiple people's advice. It is unlikely that we just go to a shop and purchase any item on a random basis.
2 regression with random forest on imbalanced data 1 Determine predictors in a model generated with caret package 3 xgboost Random Forest with sparse matrix data and multinomial Y
You will use the function RandomForest () to train the model. We need to install a RandomForest library or package to use this method. A random forest model can be built using all predictors and the target variable as the categorical outcome.
A random forest model can be built using all predictors and the target variable as the categorical outcome. Random forest was attempted with the train function from the caret package and also with the randomForest function from the randomForest package.
When the function DataLongGenerator()
creates Z
, it's a random uniform data in a matrix. The actual coding is
Z <- as.matrix(cbind(rep(1, length(f)), 2 * runif(length(f))))
Where f
represents the length of the matrices that represent each of the elements. In your example, you used 6 groups of 50 participants with 6 fixed effects. That led to a length of 472.
From what I can gather, since this function is designed to simulate longitudinal data, this is a simulation of random effects on that data. If you were working with real data, I think it would be a lot easier to understand.
While this example doesn't use RE-EM forests, I thought it was pretty clear, because it uses tangible elements as an example. You can read about random effects in section 1.2.2 Fixed v. Random Effects. https://ademos.people.uic.edu/Chapter17.html#32_fixed_effects
Look at section 3.2 to see examples of random effects that you could intentionally model if you were working with real data.
Another example: You're running a cancer drug trial. You've collected patient demographics on a weekly basis: weight, temperature, and a CBC panel and different groups of drug administration: 1 unit per day, 2 units per day, and 3 units per day.
In traditional regression, you'd model these variables to determine how accurately the model identifies the outcome. The fixed effects are the explainable variance or R2. So if you've .86 or 86% then 14% is unexplained. It could be an interaction causing the noise, the unexplained variance between perfect and what the model determined was the outcome.
Let's say the patients with really low white blood cell counts and were overweight responded far better to the treatment. Or perhaps the patients with red hair responded better; that's not in your data. In terms of longitudinal data, let's say that the relationship (the interaction relationship) only appears after some measure of time passes.
You can try to model different relationships to evaluate the random interactions in the data. I think you'd be better off with one of the many ways to evaluate interactions systematically than a random attempt to identify random effects, though.
EDITED I started to write this in the comments with @JustGettinStarted, but it was too much.
Without the background - the easiest way to achieve this would be to run something like REEMtree::REEMtree(), setting the random effects argument to random = ~1 | time / id)
. After it runs, extract the random effects it's calculated. You can do it like this:
data2 <- data %>% mutate(oOrder = row_number()) %>% # identify original order of the data
arrange(time, id) %>%
mutate(zOrder = row_number()) # because the random effects will be in order by time then id
extRE <- data.frame(time = attributes(fit$RandomEffects[2][["id"]])[["row.names"]]) %>%
separate(col = time,
into = c("time", "id"),
sep = "\\/") %>%
mutate(Z = fit$RandomEffects[[2]] %>% unlist(),
id = as.integer(id),
time = time)) # set data type to match dataset for time
data2 <- data2 %>% left_join(extRE) %>% arrange(oOrder) # return to original order
Z = cbind(rep(1, times = nrows(data2)), data2$Z)
Alternatively, I suggest that you start with the random generation of random effects. The random-effects you start with are just a jumping-off point. The random effects at the end will be different.
No matter how many ways I tried to use LongituRF::REEMforest()
with real data, I ran into errors. I had an uninvertible matrix failure every time.
I noticed that the data generated by DataLongGenerator()
comes in order by id, then time. I tried to order the data (and Z) that way, but it didn't help. When I extracted all the functionality out of the package LongituRF
, I used the MERF (multiple-effects random forest) function with no problems. Even in the research paper, that method was solid. Just thought it was worth mentioning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With