Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove filter based on table with NA

Tags:

r

data.table

I'm assigning rows of data to several different groups. The main issue is there are many groups, but not every group is using the same set of fields. I would like to set up a reference table that I could loop over or shove through a function but I don't know how to remove the fields from the filter where they are unneeded.

Below is sample code, I've included a version of my current solution as well as an example table.

library(data.table)

set.seed(1)

n <- 1000

#Sample Data
ExampleData <- data.table(sample(1:3,n,replace = TRUE),
                          sample(10:12,n,replace = TRUE),
                          sample(letters[1:3],n,replace = TRUE),
                          sample(LETTERS[1:3],n,replace = TRUE))

#Current solution
ExampleData[V1 == 1 & V2 == 11 & V4 == "C", Group := "Group1"]
ExampleData[V1 == 2, Group := "Group2"]
ExampleData[V1 == 3 & V3 == "a" & V4 == "B", Group := "Group3"]


#Example reference table
ExampleRefTable <- data.table(Group = c("Group1","Group2","Group3"),
                              V1 = c(1,2,3),
                              V2 = c(11,NA,NA),
                              V3 = c(NA,NA,"a"),
                              V4 = c("C",NA,"B"))
like image 299
araskazes Avatar asked Mar 12 '19 14:03

araskazes


1 Answers

(Thanks to @eddi:) You could iterate over rows/groups in the ref table with by=:

ExampleRefTable[, 
  ExampleData[copy(.SD), on = names(.SD)[!is.na(.SD)], grp := .BY$Group]
, by = Group] 

For each Group, we are using .SD (the rest of the Subset of the ref table Data) for an update join, ignoring columns of .SD that are NA. .BY contains the per-group values of by=.


(My original answer:) You could split up the ref table into subsets with non-NA values:

ExampleRefTable[, gNA := .GRP, by=ExampleRefTable[, !"Group"]]

RefTabs = lapply(
  split(ExampleRefTable, by="gNA", keep.by = FALSE), 
  FUN = Filter, f = function(x) !anyNA(x)
)

which looks like

$`1`
    Group V1 V2 V4
1: Group1  1 11  C

$`2`
    Group V1
1: Group2  2

$`3`
    Group V1 V3 V4
1: Group3  3  a  B

Then iterate over these tables with update joins:

ExampleData[, Group := NA_character_]
for (i in seq_along(RefTabs)){
  RTi = RefTabs[[i]]
  nmi = setdiff(names(RTi), "Group")

  ExampleData[is.na(Group), Group := 
    RTi[copy(.SD), on=names(.SD), x.Group]
  , .SDcols=nmi][]
} 

rm(RTi, nmi)

By filtering on is.na(Group), I'm assuming that the rules in the ref table are mutually exclusive.

The copy on .SD is needed due to an open issue.

This might be more efficient than @eddi's way (at the top of this answer) if there are many groups sharing the same missing/nonmissing columns.


If you are manually writing your ref table, I would suggest...

rbindlist(idcol = "Group", fill = TRUE, list(
  NULL = list(V1 = numeric(), V2 = numeric(), V3 = character(), V4 = character()),
  Group1 = list(V1 = 1, V2 = 11, V4 = "C"),
  Group2 = list(V1 = 2),
  Group3 = list(V1 = 3, V3 = "a", V4 = "B")
))


    Group V1 V2   V3   V4
1: Group1  1 11 <NA>    C
2: Group2  2 NA <NA> <NA>
3: Group3  3 NA    a    B

for easier reading and editing.

like image 123
Frank Avatar answered Sep 20 '22 16:09

Frank