Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested ifelse with varying columns in data.table

I need to compute a "best value" for each row of some columns of a data.table. The best value for each row is the value of the first non-NA column in the given order of selected columns.

As a requirement, the columns to include may vary by order or number. In addition, the name of the column giving the best value should be stored for each row.

Example data

With

library(data.table)
library(magrittr)
n <- 7
set.seed(1234)
dt <- sample.int(100, n*5, replace = TRUE) %>% 
  ifelse(. < 35, NA, .) %>% 
  matrix(, nrow = n) %>% 
  as.data.table()

the sample data.table is

   V1 V2 V3 V4 V5
1: NA NA NA NA 84
2: 63 67 84 NA NA
3: 61 52 NA NA 46
4: 63 70 NA NA NA
5: 87 55 NA 82 NA
6: 65 NA NA 53 51
7: NA 93 NA 92 NA

The columns to be included in the given order are

selected_cols <- c("V3", "V4", "V1")

Expected result with hard-coded nested ifelse

The hardcoded version

dt[, best_value := ifelse(!is.na(V3), V3, ifelse(!is.na(V4), V4, V1))]

will give the expected result for the best value

   V1 V2 V3 V4 V5 best_value
1: NA NA NA NA 84         NA
2: 63 67 84 NA NA         84
3: 61 52 NA NA 46         61
4: 63 70 NA NA NA         63
5: 87 55 NA 82 NA         82
6: 65 NA NA 53 51         53
7: NA 93 NA 92 NA         92

but it still doesn't show from which of the columns the best value was taken.

In row 2 column V3 already has a non-NA value. For rows 5, 6, and 7, the values from column V4 are taken. Finally, column V1 gives the values for rows 3 and 4 where both V3 and V4 are NA. Row 1 contains a NA because all columns under consideration are NA.

Flexible approach with for loop

Using a for loop over the selected columns and some data.table features

dt[, best_value := NA_integer_]
dt[, best_col := NA_character_]
for (x in selected_cols) {
  dt[is.na(best_value), best_col := ifelse(!is.na(.SD), names(.SD), NA), .SDcols = x]
  dt[is.na(best_value), best_value:= .SD, .SDcols = x]
}

we get the full expected result

   V1 V2 V3 V4 V5 best_value best_col
1: NA NA NA NA 84         NA       NA
2: 63 67 84 NA NA         84       V3
3: 61 52 NA NA 46         61       V1
4: 63 70 NA NA NA         63       V1
5: 87 55 NA 82 NA         82       V4
6: 65 NA NA 53 51         53       V4
7: NA 93 NA 92 NA         92       V4

In addition, the vector of columns to be included can be changed easily.

Question

However, the approach with a for loop with two statements looks rather clumsy to me and not very data.table-like.

Is there a better way to achieve these result with data.table or dplyr or even in base R?

like image 786
Uwe Avatar asked Jun 12 '16 17:06

Uwe


2 Answers

Working on your 'for' loop and taking advantage of the list - data.table structure:

ans_col = rep_len(NA_character_, nrow(dt))
ans_val = rep_len(NA_real_, nrow(dt))
for(col in selected_cols) {
    i = is.na(ans_col) & (!is.na(dt[[col]]))
    ans_col[i] = col
    ans_val[i] = dt[[col]][i]   
}
data.frame(ans_val, ans_col)
#  ans_val ans_col
#1      NA    <NA>
#2      84      V3
#3      61      V1
#4      63      V1
#5      82      V4
#6      53      V4
#7      92      V4
like image 131
alexis_laz Avatar answered Nov 03 '22 00:11

alexis_laz


We specify the 'selected_cols' in .SDcols, grouped by sequence of rows, we unlist the Subset of Data.table (unlist(.SD)), get the index of the first non-NA value ('j1'), use that to get the 'v1' corresponding to the index and the column names, assign (:=) to create two new columns.

dt[, c("best_val", "best_col") := {v1 <- unlist(.SD)
     j1 <- which(!is.na(v1))[1]
     list(v1[j1], names(.SD)[j1]) },
        .SDcols = selected_cols, by = 1:nrow(dt)]
dt
#   V1 V2 V3 V4 V5 best_val best_col
#1: NA NA NA NA 84       NA       NA
#2: 63 67 84 NA NA       84       V3
#3: 61 52 NA NA 46       61       V1
#4: 63 70 NA NA NA       63       V1
#5: 87 55 NA 82 NA       82       V4
#6: 65 NA NA 53 51       53       V4
#7: NA 93 NA 92 NA       92       V4

If we are using base R, row/column indexing can be used with max.col

setDF(dt)
j1 <-  max.col(!is.na(dt[selected_cols]), "first")
best_value <- dt[selected_cols][cbind(1:nrow(dt),j1)]
best_value
#[1] NA 84 61 63 82 53 92
j2 <- j1*NA^(!rowSums(!is.na(dt[selected_cols])))

best_col <- selected_cols[j2]
best_col
#[1] NA   "V3" "V1" "V1" "V4" "V4" "V4"
like image 37
akrun Avatar answered Nov 03 '22 01:11

akrun