Using case_when and between with a correspondance threshold table

Question

Good morning,

I have 2 dataframes : (25000,66) and a thershold table (10,2) containing 10 groups and the last id of each group.

In the big dataset I have one variable called id. It's just id = row_number()

id
1
2
3
4
5
...
25000

EDIT : Lots of answers, thank you for all your ideas. Reading I realized that I forgot an important step in my data description and I apologized.

I'm using synthetic sampling on the original bigdataset to genereate new points. So after sampling the id column looks like this :

id
1
2
2.1
3
3.8
4.74
5.12
6
...
25000

This is why I used the between clause with the last_id to reassign the id to their group.

Threshold table :

last_id   group_name
50        grp1
1500      grp2
8900      grp3
...
25000     grp10

I would like to add new column to the big dataset in order to have the id and the group name, based only on the condition that the id falls in the group specified range by the threshold table.

For now I wrote this :

df <- df %>%
    dplyr::mutate(group_name = case_when(id < last_id[1,1] ~ last_id[1,2],
                                                between(id, last_id[1,1], last_id[2,1]) ~ last_id[2,2],
                                                between(id, last_id[2,1], last_id[3,1]) ~ last_id[3,2],
                                                between(id, last_id[3,1], last_id[4,1]) ~ last_id[4,2],
                                                between(id, last_id[4,1], last_id[5,1]) ~ last_id[5,2],
                                                between(id, last_id[5,1], last_id[6,1]) ~ last_id[6,2],
                                                between(id, last_id[6,1], last_id[7,1]) ~ last_id[7,2],
                                                between(id, last_id[7,1], last_id[8,1]) ~ last_id[8,2],
                                                between(id, last_id[8,1], last_id[9,1]) ~ last_id[9,2],
                                                id > last_id[9,1] ~ last_id[10,2]))
    )

But it doesn't work, I get this error :

Error in FUN(left, right) : comparaison (5) ony possible for types list and atomic

Moreover this code looks terrible, there must be another way using apply or another dplyr function?

Thank you for reading.

Jaap · Accepted Answer

Two options with the data.table-package:

1) use the rolling join functionality

dt <- dt2[dt1, on = .(last_id = id), roll = -Inf]

which gives:

> dt
      last_id group_name
   1:       1       grp1
   2:       2       grp1
   3:       3       grp1
   4:       4       grp1
   5:       5       grp1
  ---                   
8896:    8896       grp3
8897:    8897       grp3
8898:    8898       grp3
8899:    8899       grp3
8900:    8900       grp3

2) use the non-equi join functionality

# create a 'first_id'
dt2[, first_id := shift(last_id, fill = 0)]
# perform the non-equi join
dt1[dt2, on = .(id > first_id, id <= last_id), group := group_name]

This method will update dt1 instead of creating a new data.table and is therefor more memory efficient:

> dt1
        id group
   1:    1  grp1
   2:    2  grp1
   3:    3  grp1
   4:    4  grp1
   5:    5  grp1
  ---           
8896: 8896  grp3
8897: 8897  grp3
8898: 8898  grp3
8899: 8899  grp3
8900: 8900  grp3

Options using base R:

Pure base R with findInterval (which is comparable to the cut-method from @ Otto Kässi):

df1$group_name <- df2$group_name[findInterval(df1$id, c(0, df2$last_id), left.open = TRUE)]

Or with base R's merge and zoo::na.locf:

df <- merge(df1, df2, by.x = "id", by.y = "last_id", all.x = TRUE)
df$group_name <- zoo::na.locf(df$group_name, fromLast = TRUE)

Used data:

df1 <- data.frame(id = 1:8900)
df2 <- read.table(text="last_id   group_name
50        grp1
1500      grp2
8900      grp3
", header=TRUE, stringsAsFactors=FALSE)

library(data.table)
dt1 <- as.data.table(df1)
dt2 <- as.data.table(df2)

Using case_when and between with a correspondance threshold table

Tags:

r

dplyr

Benson_YoureFired

1 Answers

Jaap

Recent Activity

Donate For Us

Using case_when and between with a correspondance threshold table

Tags:

r

dplyr

Benson_YoureFired

1 Answers

Jaap

Related questions

Recent Activity

Donate For Us