Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

select repeated row observations with the least absolute difference

Tags:

dataframe

r

dplyr

I have a data frame like this:

df <- data.frame(id = c(1,1,1,2,2,3,3,3,3),
                  vars = c(1,2,5, 1,3, 0,2,4,-1))

> df
  id vars
1  1    1
2  1    2
3  1    5
4  2    1
5  2    3
6  3    0
7  3    2
8  3    4
9  3   -1

In this data frame each id can have several observations. I now want to select for each id the pair (2 observations) that have the least absolute difference for vars.

In the above case that would be

  id vars
1  1    1
2  1    2
3  2    1
4  2    3
5  3    0
6  3   -1

for id 1, values 1 and 2 have the lowest absolute difference, id 2 only has 2 observations so both are automatically selected. for the id 3 the selected vars would be 0 and -1 because the absolute difference is 1, lower than all other combinations.

like image 720
spore234 Avatar asked Jan 01 '23 23:01

spore234


2 Answers

You don't need to do all the comparisons (or, you can let arrange do your comparisons for you), because once you've sorted the values each value is already beside the value for which the difference is minimized.

df %>% 
  group_by(id) %>% 
  arrange(vars) %>% 
  slice(which.min(diff(vars)) + 0:1)

# # A tibble: 6 x 2
# # Groups:   id [3]
#      id  vars
#   <dbl> <dbl>
# 1     1     1
# 2     1     2
# 3     2     1
# 4     2     3
# 5     3    -1
# 6     3     0

data.table version

library(data.table)
setDT(df)

df[df[order(vars), .I[which.min(diff(vars)) + 0:1], id]$V1]

#    id vars
# 1:  3   -1
# 2:  3    0
# 3:  1    1
# 4:  1    2
# 5:  2    1
# 6:  2    3
like image 194
IceCreamToucan Avatar answered Jan 08 '23 02:01

IceCreamToucan


Not the most concise but works. Probably somebody can improve the idea.

df1%>%group_by(id)%>%mutate(vars2=which.min(abs(diff(combn(num(vars),2)))))%>%
  mutate(vars1=ifelse(vars%in%combn(num(vars),2)[,vars2],vars,NA))%>%select(id,vars1)%>%.[complete.cases(.),]

# A tibble: 6 x 2
# Groups:   id [3]
     id vars1
  <dbl> <dbl>
1     1     1
2     1     2
3     2     1
4     2     3
5     3     0
6     3    -1

The main idea is to do the difference on all the possible combinations of the values of each group. vars2 keeps the column with the lowest difference. If the value is one of the two present in the vars2 column, it is kept. Else, it is set as NA. Then, only complete cases are returned.

like image 40
boski Avatar answered Jan 08 '23 02:01

boski