I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.
Data -
# A tibble: 81,334 x 4
latitude longitude close_date close_price
<dbl> <dbl> <dttm> <dbl>
1 36.4 -98.7 2014-08-05 06:34:00 147504.
2 36.6 -97.9 2014-08-12 23:48:00 137401.
3 36.6 -97.9 2014-08-09 04:00:40 239105.
Model -
library(caret)
training.samples <- data$close_price %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
model <- train(
close_price~ ., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center", "scale"),
tuneLength = 10
)
My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.
I want to apply a rule to the model that says, for each value y
, only use houses that closed before the house for that y
. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.
Is it possible to prevent this time leakage, either in caret
or other libraries for knn (like class
and kknn
)?
In caret
, createTimeSlices
implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin).
Documentation is here.
In your case, depending on your precise needs, you could use something like this for a proper cross-validation:
your_data <- your_data %>% arrange(close_date)
tr_ctrl <- createTimeSlices(
your_data$close_price,
initialWindow = 10,
horizon = 1,
fixedWindow = FALSE)
model <- train(
close_price~ ., data = your_data, method = "knn",
trControl = tr_ctrl,
preProcess = c("center", "scale"),
tuneLength = 10
)
EDIT:
if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl
before using it in train
:
filter_train <- function(i_tr, i_te) {
d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
d_te <- as_date(your_data$close_date[i_te])
tr_is_ok <- d_tr < min(d_te)
i_tr[tr_is_ok]
}
tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With