r - copy missing values from other variables

Question

Simple question, but I can't figure out how to do the following. This is my data:

ID    Time1    Time2    Time3    Time4
01    23       23       NA       NA  
02    21       21       21       NA
03    22       22       25       NA
04    29       29       20       NA
05    NA       NA       15       22
06    NA       NA       11       NA

Now, I want to replace missing values (NA) with the data that is available in other variables. Importantly, I need r to take the value that is 'closest' to the missing data point. E.g., for ID 5, Time1 and Time2 should be "15" (not "22").

Like this:

ID    Time1    Time2    Time3    Time4
01    23       23       23       23  
02    21       21       21       21
03    22       22       25       25
04    29       29       20       20
05    15       15       15       22
06    11       11       11       11

I've tried ifelse statements, but this did not work out.

Thanks!

Frank · Accepted Answer

With data.table's rolling joins and set:

library(data.table)
good = as.data.table( which(!is.na(df[-1]), arr.ind = TRUE) )
all = CJ(row = seq(nrow(df)), col = seq(2L, ncol(df)))
good$col = good$col + 1L
good$col_src = good$col

changes = good[all, on = c("row", "col"), roll="nearest"][ col != col_src ]

changes[, {
  set(df, i = row, j = col, value = df[[ col_src ]][row])
  NULL
}, by=.(col,col_src)]

# based on input from bgoldst's answer
   ID  1  2  3  4
1: 01 23 23 23 23
2: 02 21 21 21 21
3: 03 22 22 25 25
4: 04 NA NA NA NA
5: 05 29 29 20 20
6: 06 15 15 15 22
7: 07 11 11 11 11
8: 08  1  1  2  2

We find all entries to switch and then modify by reference with set. I'm not sure how roll="nearest" handles ties, but I'm sure that can be tweaked.

bgoldst · Answer

This is much more difficult that it looks. I built a solution that works on one column at a time, taking the pmin() of the absolute distance between all time column indexes and the current column index, stripping NAs with the na.rm=T argument. The result can then be used to index the original time columns using an index matrix, which can then be assigned to the current column index in the target data.frame.

An advantage of this design is that it's fully vectorized over the rows. In other words, it doesn't iterate over one row at a time. This could be an advantage for extremely row-heavy inputs. On the other hand, the solution does involve building matrices that parallel all time columns (timemat, nacols, and off), which could be expensive for large inputs. It's basically trading away memory to save CPU.

I added a couple of rows to test additional cases not covered by the OP's sample data.frame; specifically (1) an all-NA row, and (2) a row with candidate non-NA values on either side of NA values.

Input:

df <- data.frame(ID=c('01','02','03','04','05','06','07','08'),Time1=c(23L,21L,22L,NA,29L,NA,NA,1L),Time2=c(23L,21L,22L,NA,29L,NA,NA,NA),Time3=c(NA,21L,25L,NA,20L,15L,11L,NA),Time4=c(NA,NA,NA,NA,NA,22L,NA,2L),stringsAsFactors=F);
df;
##   ID Time1 Time2 Time3 Time4
## 1 01    23    23    NA    NA
## 2 02    21    21    21    NA
## 3 03    22    22    25    NA
## 4 04    NA    NA    NA    NA
## 5 05    29    29    20    NA
## 6 06    NA    NA    15    22
## 7 07    NA    NA    11    NA
## 8 08     1    NA    NA     2

Solution:

ris <- seq_len(nrow(df));
cis <- grep('^Time',names(df));
timemat <- as.matrix(df[cis]);
nacols <- as.data.frame(ifelse(is.na(timemat),NA,col(timemat)));
nacols;
##   Time1 Time2 Time3 Time4
## 1     1     2    NA    NA
## 2     1     2     3    NA
## 3     1     2     3    NA
## 4    NA    NA    NA    NA
## 5     1     2     3    NA
## 6    NA    NA     3     4
## 7    NA    NA     3    NA
## 8     1    NA    NA     4
for (ci in seq_len(ncol(timemat))) {
    off <- abs(nacols-ci);
    best <- which(off==do.call(pmin,c(off,na.rm=T)),arr.ind=T);
    df[cis[ci]] <- timemat[matrix(c(ris,best[match(ris,best[,'row']),'col']),nrow(df))];
};
df;
##   ID Time1 Time2 Time3 Time4
## 1 01    23    23    23    23
## 2 02    21    21    21    21
## 3 03    22    22    25    25
## 4 04    NA    NA    NA    NA
## 5 05    29    29    20    20
## 6 06    15    15    15    22
## 7 07    11    11    11    11
## 8 08     1     1     2     2

Rcpp solution:

library(Rcpp);
cppFunction('
    IntegerMatrix fillDFNAsWithNearestInRow(DataFrame df, IntegerVector cis ) {
        IntegerMatrix res(df.nrows(),cis.size());
        if (df.nrows()==0 || cis.size()==0) return res;
        IntegerVector cis0 = clone(cis); for (int cisi = 0; cisi < cis0.size(); ++cisi) --cis0[cisi]; // correct from R 1-based to Rcpp 0-based
        for (int cisi = 0; cisi < cis0.size(); ++cisi) {
            IntegerVector colCur = df[cis0[cisi]];
            for (int ri = 0; ri < colCur.size(); ++ri) {
                if (!IntegerVector::is_na(colCur[ri])) {
                    res(ri,cisi) = colCur[ri];
                    continue;
                }
                int leftOk;
                int rightOk;
                IntegerVector colLeft;
                IntegerVector colRight;
                bool set = false; // assumption
                for (int off = 1; (leftOk = cisi-off>=0, rightOk = cisi+off<cis0.size(), leftOk ) || rightOk; ++off) {
                    if (leftOk && (colLeft = df[cis0[cisi-off]], !IntegerVector::is_na(colLeft[ri]))) {
                        res(ri,cisi) = colLeft[ri];
                        set = true;
                        break;
                    } else if (rightOk && (colRight = df[cis0[cisi+off]], !IntegerVector::is_na(colRight[ri]))) {
                        res(ri,cisi) = colRight[ri];
                        set = true;
                        break;
                    }
                }
                if (!set) res(ri,cisi) = NA_INTEGER;
            }
        }
        return res;
    }
');

df <- data.frame(ID=c('01','02','03','04','05','06','07','08'),Time1=c(23L,21L,22L,NA,29L,NA,NA,1L),Time2=c(23L,21L,22L,NA,29L,NA,NA,NA),Time3=c(NA,21L,25L,NA,20L,15L,11L,NA),Time4=c(NA,NA,NA,NA,NA,22L,NA,2L),stringsAsFactors=F);
cis <- grep('^Time',names(df));
df[cis] <- fillDFNAsWithNearestInRow(df,cis);
df;
##   ID Time1 Time2 Time3 Time4
## 1 01    23    23    23    23
## 2 02    21    21    21    21
## 3 03    22    22    25    25
## 4 04    NA    NA    NA    NA
## 5 05    29    29    20    20
## 6 06    15    15    15    22
## 7 07    11    11    11    11
## 8 08     1     1     2     2

r - copy missing values from other variables

Tags:

r

missing-data

HJ_r

2 Answers

Frank

bgoldst

Recent Activity

Donate For Us

r - copy missing values from other variables

Tags:

r

missing-data

HJ_r

2 Answers

Frank

bgoldst

Related questions

Recent Activity

Donate For Us