I'm assigning rows of data to several different groups. The main issue is there are many groups, but not every group is using the same set of fields. I would like to set up a reference table that I could loop over or shove through a function but I don't know how to remove the fields from the filter where they are unneeded.
Below is sample code, I've included a version of my current solution as well as an example table.
library(data.table)
set.seed(1)
n <- 1000
#Sample Data
ExampleData <- data.table(sample(1:3,n,replace = TRUE),
sample(10:12,n,replace = TRUE),
sample(letters[1:3],n,replace = TRUE),
sample(LETTERS[1:3],n,replace = TRUE))
#Current solution
ExampleData[V1 == 1 & V2 == 11 & V4 == "C", Group := "Group1"]
ExampleData[V1 == 2, Group := "Group2"]
ExampleData[V1 == 3 & V3 == "a" & V4 == "B", Group := "Group3"]
#Example reference table
ExampleRefTable <- data.table(Group = c("Group1","Group2","Group3"),
V1 = c(1,2,3),
V2 = c(11,NA,NA),
V3 = c(NA,NA,"a"),
V4 = c("C",NA,"B"))
(Thanks to @eddi:) You could iterate over rows/groups in the ref table with by=
:
ExampleRefTable[,
ExampleData[copy(.SD), on = names(.SD)[!is.na(.SD)], grp := .BY$Group]
, by = Group]
For each Group, we are using .SD
(the rest of the S
ubset of the ref table D
ata) for an update join, ignoring columns of .SD
that are NA. .BY
contains the per-group values of by=
.
(My original answer:) You could split up the ref table into subsets with non-NA values:
ExampleRefTable[, gNA := .GRP, by=ExampleRefTable[, !"Group"]]
RefTabs = lapply(
split(ExampleRefTable, by="gNA", keep.by = FALSE),
FUN = Filter, f = function(x) !anyNA(x)
)
which looks like
$`1`
Group V1 V2 V4
1: Group1 1 11 C
$`2`
Group V1
1: Group2 2
$`3`
Group V1 V3 V4
1: Group3 3 a B
Then iterate over these tables with update joins:
ExampleData[, Group := NA_character_]
for (i in seq_along(RefTabs)){
RTi = RefTabs[[i]]
nmi = setdiff(names(RTi), "Group")
ExampleData[is.na(Group), Group :=
RTi[copy(.SD), on=names(.SD), x.Group]
, .SDcols=nmi][]
}
rm(RTi, nmi)
By filtering on is.na(Group)
, I'm assuming that the rules in the ref table are mutually exclusive.
The copy
on .SD
is needed due to an open issue.
This might be more efficient than @eddi's way (at the top of this answer) if there are many groups sharing the same missing/nonmissing columns.
If you are manually writing your ref table, I would suggest...
rbindlist(idcol = "Group", fill = TRUE, list(
NULL = list(V1 = numeric(), V2 = numeric(), V3 = character(), V4 = character()),
Group1 = list(V1 = 1, V2 = 11, V4 = "C"),
Group2 = list(V1 = 2),
Group3 = list(V1 = 3, V3 = "a", V4 = "B")
))
Group V1 V2 V3 V4
1: Group1 1 11 <NA> C
2: Group2 2 NA <NA> <NA>
3: Group3 3 NA a B
for easier reading and editing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With