Good morning,
I have 2 dataframes : (25000,66) and a thershold table (10,2) containing 10 groups and the last id of each group.
In the big dataset I have one variable called id. It's just id = row_number()
id
1
2
3
4
5
...
25000
EDIT : Lots of answers, thank you for all your ideas. Reading I realized that I forgot an important step in my data description and I apologized.
I'm using synthetic sampling on the original bigdataset to genereate new points. So after sampling the id column looks like this :
id
1
2
2.1
3
3.8
4.74
5.12
6
...
25000
This is why I used the between clause with the last_id to reassign the id to their group.
Threshold table :
last_id group_name
50 grp1
1500 grp2
8900 grp3
...
25000 grp10
I would like to add new column to the big dataset in order to have the id and the group name, based only on the condition that the id falls in the group specified range by the threshold table.
For now I wrote this :
df <- df %>%
dplyr::mutate(group_name = case_when(id < last_id[1,1] ~ last_id[1,2],
between(id, last_id[1,1], last_id[2,1]) ~ last_id[2,2],
between(id, last_id[2,1], last_id[3,1]) ~ last_id[3,2],
between(id, last_id[3,1], last_id[4,1]) ~ last_id[4,2],
between(id, last_id[4,1], last_id[5,1]) ~ last_id[5,2],
between(id, last_id[5,1], last_id[6,1]) ~ last_id[6,2],
between(id, last_id[6,1], last_id[7,1]) ~ last_id[7,2],
between(id, last_id[7,1], last_id[8,1]) ~ last_id[8,2],
between(id, last_id[8,1], last_id[9,1]) ~ last_id[9,2],
id > last_id[9,1] ~ last_id[10,2]))
)
But it doesn't work, I get this error :
Error in FUN(left, right) : comparaison (5) ony possible for types list and atomic
Moreover this code looks terrible, there must be another way using apply or another dplyr function?
Thank you for reading.
Two options with the data.table-package:
1) use the rolling join functionality
dt <- dt2[dt1, on = .(last_id = id), roll = -Inf]
which gives:
> dt last_id group_name 1: 1 grp1 2: 2 grp1 3: 3 grp1 4: 4 grp1 5: 5 grp1 --- 8896: 8896 grp3 8897: 8897 grp3 8898: 8898 grp3 8899: 8899 grp3 8900: 8900 grp3
2) use the non-equi join functionality
# create a 'first_id'
dt2[, first_id := shift(last_id, fill = 0)]
# perform the non-equi join
dt1[dt2, on = .(id > first_id, id <= last_id), group := group_name]
This method will update dt1 instead of creating a new data.table and is therefor more memory efficient:
> dt1 id group 1: 1 grp1 2: 2 grp1 3: 3 grp1 4: 4 grp1 5: 5 grp1 --- 8896: 8896 grp3 8897: 8897 grp3 8898: 8898 grp3 8899: 8899 grp3 8900: 8900 grp3
Options using base R:
Pure base R with findInterval (which is comparable to the cut-method from @ Otto Kässi):
df1$group_name <- df2$group_name[findInterval(df1$id, c(0, df2$last_id), left.open = TRUE)]
Or with base R's merge and zoo::na.locf:
df <- merge(df1, df2, by.x = "id", by.y = "last_id", all.x = TRUE)
df$group_name <- zoo::na.locf(df$group_name, fromLast = TRUE)
Used data:
df1 <- data.frame(id = 1:8900)
df2 <- read.table(text="last_id group_name
50 grp1
1500 grp2
8900 grp3
", header=TRUE, stringsAsFactors=FALSE)
library(data.table)
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With