Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R how do I extract rows based on another data frame where the values don't need to be exact? They just need to be close enough?

Tags:

merge

r

I have two data frames with mostly the same values

df1:

v1 v2 v3 v4 v5 v6 v7 ......

500 40 5.2 z1 .....

500 40 7.2 z2 .....

500 40 9.0 z3 .....

500 40 3.5 z4 .....

500 40 4.2 z5 .....

df2:

v1 v2 v3 v4 v5 v6 v7 .....

500 40 5.1 m1 .....

500 40 7.9 m2 .....

500 20 8.6 m3 .....

500 40 3.7 m4 .....

500 40 4.0 m5 .....

I would like to merge (or any function like it) so that my new df1 file has exact matching v1 and v2, but v3 does not need to be strictly exact. Is there a way I can match v3 to within +/- 0.2?

I would like the final df1 to look like:

v1 v2 v3 v4 v5 v6 v7 .....

500 40 5.2 z1 .....

500 40 3.5 z4 .....

500 40 4.2 z5 .....

I get as far as below, but I'm not sure how to account for the variability of column v3.

hed <- c("v1", "v2", "v3") #original data didn't have header
df1_final <- merge(df1, df2[hed],by=hed)

If there is a better language to deal with this I'd accept that too, but this is only one part to an entire R script i'm working on.

like image 976
Mason J. Avatar asked Mar 06 '23 14:03

Mason J.


1 Answers

Using the tidyverse we could first join, then filter with near (and a tolerance):

library(tidyverse)

df1 <- data_frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 40, 40, 40),
                  v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
                  v4 = c("z1", "z2", "z3", "z4", "z5"))

df2 <- data_frame(v1 = c(500, 500, 500, 500, 500),
                  v2 = c(40, 40, 20, 40, 40),
                  v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
                  v4 = c("m1", "m2", "m3", "m4", "m5"))

df1 %>%
  full_join(df2, by = c("v1", "v2")) %>%    # join on v1 and v2
  filter(near(v3.x, v3.y, tol = 0.21)) %>%  # filter with a tolerance
  rename(v3 = v3.x, v4 = v4.x) %>%          # rename the columns
  select(v1:v4)                             # select em

This yields

# A tibble: 3 x 4
     v1    v2    v3 v4   
  <dbl> <dbl> <dbl> <chr>
1  500.   40.  5.20 z1   
2  500.   40.  3.50 z4   
3  500.   40.  4.20 z5 
like image 120
Jan Avatar answered Apr 06 '23 01:04

Jan