Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Filter two data.frames for duplicated values in two columns and by groups

I have a data.frame dat which stores my normal data and groups are defined by ID.

data <- structure(list(NAME = structure(c(1L, 1L, 2L), .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 23L, 57L), REF_YEAR = c(1920L, 1938L, 1869L), SURV_YEAR = c(1938L, 1962L, 1872L), VALUE = c(20L, 40L, 34L)), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR","VALUE"), class = "data.frame", row.names = c(NA, -3L))

  NAME  ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23     1920      1938    20
2 NAME1 23     1938      1962    40
3 NAME2 57     1869      1872    34

And I have a second data.frame, dat_q which I would like to compare to dat

dat_q <- structure(list(NAME = structure(1:2, .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 57L), REF_YEAR = c(1934L, 1866L), SURV_YEAR = c(1938L, 1868L), VALUE = structure(1:2, .Label = c("A", "B"), class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -2L))

  NAME  ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23     1934      1938     A
2 NAME2 57     1866      1868     B

My question: How could I delete all rows in dat_q that contain an equal value in the columns REF_YEAR or SURV_YEAR than in the same columns of dat (in the sample data 1938)? This should be applied by group (as defined by ID) and not over the whole data.frame

In the end, using my sample data this would be the result coming from filtering dat_q

  NAME  ID REF_YEAR SURV_YEAR VALUE
2 NAME2 57     1866      1868     B

EDIT

Here is some other sample data with which the code provided by @thelatemail won't work. And I can't figure out why, dat_q should be filtered out, because it contains an exact same value than dat.

data <- structure(list(NAME = structure(c(1L, 1L, 1L), .Label = "NAME1", class = "factor"), ID = c(226L, 226L, 226L), SURV_YEAR = c(2009L, 2010L, 2012L), REF_YEAR = c(2008L, 2009L, 2011L), VALUE = c(-7L, -37L,  -51L)), .Names = c("NAME", "ID", "SURV_YEAR", "REF_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L))

   NAME  ID SURV_YEAR REF_YEAR VALUE
1 NAME1 226      2009     2008    -7
2 NAME1 226      2010     2009   -37
3 NAME1 226      2012     2011   -51

dat_q <- structure(list(NAME = structure(1L, .Label = "NAME1", class = "factor"), ID = 226L, REF_YEAR = 2010L, SURV_YEAR = 2011L, VALUE = structure(1L, .Label = "-X", class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -1L))

  NAME   ID REF_YEAR SURV_YEAR VALUE
1 NAME1 226     2010      2011    -X

like image 910
kurdtc Avatar asked Sep 30 '14 22:09

kurdtc


1 Answers

I like by in base R for figuring out the logic of this sort of problem. This works, but may be a bit slow:

do.call(rbind,by(
  dat_q,
  dat_q$ID,
  function(x) {
    subdata <- data[data$ID==x$ID,]
    x[!(x$REF_YEAR %in% subdata$REF_YEAR | x$SURV_YEAR %in% subdata$SURV_YEAR),]
  }
))

#    NAME ID REF_YEAR SURV_YEAR VALUE
#57 NAME2 57     1866      1868     B

A data.table solution following the same logic may be quicker:

library(data.table)
setDT(dat_q)
setDT(data)
dat_q[
     ,
     .SD[!(REF_YEAR   %in% data$REF_YEAR[data[,ID==.BY]] | 
           SURV_YEAR  %in% data$SURV_YEAR[data[,ID==.BY]])],
     by=ID
]

#   ID  NAME REF_YEAR SURV_YEAR VALUE
#1: 57 NAME2     1866      1868     B

With data.table, I think you can also do it this way. After converting to data.tables,

# using 1.9.3+, just remove `by=.EACHI` if you're using <= 1.9.2
setkey(data, ID)
setkey(dat_q, ID)

idx = data[dat_q, any(c(i.REF_YEAR, i.SURV_YEAR) %in% c(REF_YEAR, SURV_YEAR)), by=.EACHI]$V1
dat_q[!idx]
#     NAME ID REF_YEAR SURV_YEAR VALUE
# 1: NAME2 57     1866      1868     B

We perform a join, and on each matching rows of data corresponding to dat_q, on the key columns, we compute the expression in j. That gives us the logical value we need to index/subset dat_q later.

like image 141
thelatemail Avatar answered Nov 15 '22 09:11

thelatemail