Nested ifelse with varying columns in data.table

Question

I need to compute a "best value" for each row of some columns of a data.table. The best value for each row is the value of the first non-NA column in the given order of selected columns.

As a requirement, the columns to include may vary by order or number. In addition, the name of the column giving the best value should be stored for each row.

Example data

With

library(data.table)
library(magrittr)
n <- 7
set.seed(1234)
dt <- sample.int(100, n*5, replace = TRUE) %>% 
  ifelse(. < 35, NA, .) %>% 
  matrix(, nrow = n) %>% 
  as.data.table()

the sample data.table is

   V1 V2 V3 V4 V5
1: NA NA NA NA 84
2: 63 67 84 NA NA
3: 61 52 NA NA 46
4: 63 70 NA NA NA
5: 87 55 NA 82 NA
6: 65 NA NA 53 51
7: NA 93 NA 92 NA

The columns to be included in the given order are

selected_cols <- c("V3", "V4", "V1")

Expected result with hard-coded nested `ifelse`

The hardcoded version

dt[, best_value := ifelse(!is.na(V3), V3, ifelse(!is.na(V4), V4, V1))]

will give the expected result for the best value

   V1 V2 V3 V4 V5 best_value
1: NA NA NA NA 84         NA
2: 63 67 84 NA NA         84
3: 61 52 NA NA 46         61
4: 63 70 NA NA NA         63
5: 87 55 NA 82 NA         82
6: 65 NA NA 53 51         53
7: NA 93 NA 92 NA         92

but it still doesn't show from which of the columns the best value was taken.

In row 2 column V3 already has a non-NA value. For rows 5, 6, and 7, the values from column V4 are taken. Finally, column V1 gives the values for rows 3 and 4 where both V3 and V4 are NA. Row 1 contains a NA because all columns under consideration are NA.

Flexible approach with `for` loop

Using a for loop over the selected columns and some data.table features

dt[, best_value := NA_integer_]
dt[, best_col := NA_character_]
for (x in selected_cols) {
  dt[is.na(best_value), best_col := ifelse(!is.na(.SD), names(.SD), NA), .SDcols = x]
  dt[is.na(best_value), best_value:= .SD, .SDcols = x]
}

we get the full expected result

   V1 V2 V3 V4 V5 best_value best_col
1: NA NA NA NA 84         NA       NA
2: 63 67 84 NA NA         84       V3
3: 61 52 NA NA 46         61       V1
4: 63 70 NA NA NA         63       V1
5: 87 55 NA 82 NA         82       V4
6: 65 NA NA 53 51         53       V4
7: NA 93 NA 92 NA         92       V4

In addition, the vector of columns to be included can be changed easily.

Question

However, the approach with a for loop with two statements looks rather clumsy to me and not very data.table-like.

Is there a better way to achieve these result with data.table or dplyr or even in base R?

alexis_laz · Accepted Answer

Working on your 'for' loop and taking advantage of the list - data.table structure:

ans_col = rep_len(NA_character_, nrow(dt))
ans_val = rep_len(NA_real_, nrow(dt))
for(col in selected_cols) {
    i = is.na(ans_col) & (!is.na(dt[[col]]))
    ans_col[i] = col
    ans_val[i] = dt[[col]][i]   
}
data.frame(ans_val, ans_col)
#  ans_val ans_col
#1      NA    <NA>
#2      84      V3
#3      61      V1
#4      63      V1
#5      82      V4
#6      53      V4
#7      92      V4

akrun · Answer

We specify the 'selected_cols' in .SDcols, grouped by sequence of rows, we unlist the Subset of Data.table (unlist(.SD)), get the index of the first non-NA value ('j1'), use that to get the 'v1' corresponding to the index and the column names, assign (:=) to create two new columns.

dt[, c("best_val", "best_col") := {v1 <- unlist(.SD)
     j1 <- which(!is.na(v1))[1]
     list(v1[j1], names(.SD)[j1]) },
        .SDcols = selected_cols, by = 1:nrow(dt)]
dt
#   V1 V2 V3 V4 V5 best_val best_col
#1: NA NA NA NA 84       NA       NA
#2: 63 67 84 NA NA       84       V3
#3: 61 52 NA NA 46       61       V1
#4: 63 70 NA NA NA       63       V1
#5: 87 55 NA 82 NA       82       V4
#6: 65 NA NA 53 51       53       V4
#7: NA 93 NA 92 NA       92       V4

If we are using base R, row/column indexing can be used with max.col

setDF(dt)
j1 <-  max.col(!is.na(dt[selected_cols]), "first")
best_value <- dt[selected_cols][cbind(1:nrow(dt),j1)]
best_value
#[1] NA 84 61 63 82 53 92
j2 <- j1*NA^(!rowSums(!is.na(dt[selected_cols])))

best_col <- selected_cols[j2]
best_col
#[1] NA   "V3" "V1" "V1" "V4" "V4" "V4"

Nested ifelse with varying columns in data.table

Tags:

dataframe

r

data.table

dplyr

Example data

Expected result with hard-coded nested `ifelse`

Flexible approach with `for` loop

Question

Uwe

2 Answers

alexis_laz

akrun

Recent Activity

Donate For Us

Nested ifelse with varying columns in data.table

Tags:

dataframe

r

data.table

dplyr

Example data

Expected result with hard-coded nested ifelse

Flexible approach with for loop

Question

Uwe

2 Answers

alexis_laz

akrun

Related questions

Recent Activity

Donate For Us

Expected result with hard-coded nested `ifelse`

Flexible approach with `for` loop