I have a data frame which I would like to split into a train and test set by group ID. The following code samples random rows and puts them into a train and test df:
samp <- sample(nrow(df), 0.7 * nrow(df))
train <- df[samp, ]
test <- df[-samp, ]
However, I would like to keep my IDs grouped together.
Example input df:
my_dat <- data.frame(ID=as.factor(rep(1:3, each = 3)), Var=sample(1:100, 9))
ID Var
1 17
1 26
1 100
2 9
2 41
2 49
3 36
3 18
3 5
And desired output to:
Train:
ID Var
1 17
1 26
1 100
3 36
3 18
3 5
Test:
ID Var
2 9
2 41
2 49
Here's one way to do this using dplyr
:
library(tidyverse)
# Create more data to better demonstrate grouping effect
my_dat <-
data.frame(ID = as.factor(rep(1:3, each = 9)), Var = sample(1:100, 27))
# Randomly assign train/test groups to all values of ID
groups <-
my_dat %>%
select(ID) %>%
distinct(ID) %>%
rowwise() %>%
mutate(group = sample(
c("train", "test"),
1,
replace = TRUE,
prob = c(0.5, 0.5) # Set weights for each group here
))
groups
# Join group assignments to my_dat
my_dat <- my_dat %>%
left_join(groups)
my_dat
This approach leaves your original data intact but adds a new column defining the group (train vs test) for each row. If you want to get a dataframe with only training data, you can filter it like this:
filter(my_dat, group == "train")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With