I have a data frame like this:
df <- data.frame(id = c(1,1,1,2,2,3,3,3,3),
vars = c(1,2,5, 1,3, 0,2,4,-1))
> df
id vars
1 1 1
2 1 2
3 1 5
4 2 1
5 2 3
6 3 0
7 3 2
8 3 4
9 3 -1
In this data frame each id
can have several observations.
I now want to select for each id
the pair (2 observations) that have the least absolute difference for vars
.
In the above case that would be
id vars
1 1 1
2 1 2
3 2 1
4 2 3
5 3 0
6 3 -1
for id
1, values 1 and 2 have the lowest absolute difference,
id
2 only has 2 observations so both are automatically selected.
for the id
3 the selected vars would be 0 and -1 because the absolute difference is 1, lower than all other combinations.
You don't need to do all the comparisons (or, you can let arrange
do your comparisons for you), because once you've sorted the values each value is already beside the value for which the difference is minimized.
df %>%
group_by(id) %>%
arrange(vars) %>%
slice(which.min(diff(vars)) + 0:1)
# # A tibble: 6 x 2
# # Groups: id [3]
# id vars
# <dbl> <dbl>
# 1 1 1
# 2 1 2
# 3 2 1
# 4 2 3
# 5 3 -1
# 6 3 0
data.table version
library(data.table)
setDT(df)
df[df[order(vars), .I[which.min(diff(vars)) + 0:1], id]$V1]
# id vars
# 1: 3 -1
# 2: 3 0
# 3: 1 1
# 4: 1 2
# 5: 2 1
# 6: 2 3
Not the most concise but works. Probably somebody can improve the idea.
df1%>%group_by(id)%>%mutate(vars2=which.min(abs(diff(combn(num(vars),2)))))%>%
mutate(vars1=ifelse(vars%in%combn(num(vars),2)[,vars2],vars,NA))%>%select(id,vars1)%>%.[complete.cases(.),]
# A tibble: 6 x 2
# Groups: id [3]
id vars1
<dbl> <dbl>
1 1 1
2 1 2
3 2 1
4 2 3
5 3 0
6 3 -1
The main idea is to do the difference on all the possible combinations of the values of each group. vars2
keeps the column with the lowest difference. If the value is one of the two present in the vars2
column, it is kept. Else, it is set as NA
. Then, only complete cases are returned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With