I have a data.frame dat which stores my normal data and groups are defined by ID.
data <- structure(list(NAME = structure(c(1L, 1L, 2L), .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 23L, 57L), REF_YEAR = c(1920L, 1938L, 1869L), SURV_YEAR = c(1938L, 1962L, 1872L), VALUE = c(20L, 40L, 34L)), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR","VALUE"), class = "data.frame", row.names = c(NA, -3L))
NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23 1920 1938 20
2 NAME1 23 1938 1962 40
3 NAME2 57 1869 1872 34
And I have a second data.frame, dat_q which I would like to compare to dat
dat_q <- structure(list(NAME = structure(1:2, .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 57L), REF_YEAR = c(1934L, 1866L), SURV_YEAR = c(1938L, 1868L), VALUE = structure(1:2, .Label = c("A", "B"), class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -2L))
NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23 1934 1938 A
2 NAME2 57 1866 1868 B
My question: How could I delete all rows in dat_q that contain an equal value in the columns REF_YEAR or SURV_YEAR than in the same columns of dat (in the sample data 1938)? This should be applied by group (as defined by ID) and not over the whole data.frame
In the end, using my sample data this would be the result coming from filtering dat_q
NAME ID REF_YEAR SURV_YEAR VALUE
2 NAME2 57 1866 1868 B
EDIT
Here is some other sample data with which the code provided by @thelatemail won't work. And I can't figure out why, dat_q should be filtered out, because it contains an exact same value than dat.
data <- structure(list(NAME = structure(c(1L, 1L, 1L), .Label = "NAME1", class = "factor"), ID = c(226L, 226L, 226L), SURV_YEAR = c(2009L, 2010L, 2012L), REF_YEAR = c(2008L, 2009L, 2011L), VALUE = c(-7L, -37L, -51L)), .Names = c("NAME", "ID", "SURV_YEAR", "REF_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L))
NAME ID SURV_YEAR REF_YEAR VALUE
1 NAME1 226 2009 2008 -7
2 NAME1 226 2010 2009 -37
3 NAME1 226 2012 2011 -51
dat_q <- structure(list(NAME = structure(1L, .Label = "NAME1", class = "factor"), ID = 226L, REF_YEAR = 2010L, SURV_YEAR = 2011L, VALUE = structure(1L, .Label = "-X", class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -1L))
NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 226 2010 2011 -X
I like by in base R for figuring out the logic of this sort of problem. This works, but may be a bit slow:
do.call(rbind,by(
dat_q,
dat_q$ID,
function(x) {
subdata <- data[data$ID==x$ID,]
x[!(x$REF_YEAR %in% subdata$REF_YEAR | x$SURV_YEAR %in% subdata$SURV_YEAR),]
}
))
# NAME ID REF_YEAR SURV_YEAR VALUE
#57 NAME2 57 1866 1868 B
A data.table solution following the same logic may be quicker:
library(data.table)
setDT(dat_q)
setDT(data)
dat_q[
,
.SD[!(REF_YEAR %in% data$REF_YEAR[data[,ID==.BY]] |
SURV_YEAR %in% data$SURV_YEAR[data[,ID==.BY]])],
by=ID
]
# ID NAME REF_YEAR SURV_YEAR VALUE
#1: 57 NAME2 1866 1868 B
With data.table, I think you can also do it this way. After converting to data.tables,
# using 1.9.3+, just remove `by=.EACHI` if you're using <= 1.9.2
setkey(data, ID)
setkey(dat_q, ID)
idx = data[dat_q, any(c(i.REF_YEAR, i.SURV_YEAR) %in% c(REF_YEAR, SURV_YEAR)), by=.EACHI]$V1
dat_q[!idx]
# NAME ID REF_YEAR SURV_YEAR VALUE
# 1: NAME2 57 1866 1868 B
We perform a join, and on each matching rows of data corresponding to dat_q, on the key columns, we compute the expression in j. That gives us the logical value we need to index/subset dat_q later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With