Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a function to replace NAs from one data.frame with values from another

Tags:

r

na

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.

Here's what I cooked up, along with an example of how I use it:

fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
 mergedDf <- merge(naDf, fillDf, by=mergeCols)
 for (col in fillCols){
   colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
   colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
   k <- which( is.na( colWithNas ) )
   colWithNas[k] <- colWithOutNas[k]
   mergedDf[col] <- colWithNas
   mergedDf[[paste(col, "x", sep=".")]] <- NULL
   mergedDf[[paste(col, "y", sep=".")]] <- NULL
 }
 return(mergedDf)
}

## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )

So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.

Any guidance or refactoring would be helpful.

EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.

like image 406
JD Long Avatar asked Dec 01 '11 23:12

JD Long


People also ask

How do you replace specific values with Na?

replace_with_na_all() Replaces NA for all variables. replace_with_na_at() Replaces NA on a subset of variables specified with character quotes (e.g., c(“var1”, “var2”)). replace_with_na_if() Replaces NA based on applying an operation on the subset of variables for which a predicate function (is.

How do I replace specific values in a column in R?

replace() function in R Language is used to replace the values in the specified string vector x with indices given in list by those given in values. It takes on three parameters first is the list name, then the index at which the element needs to be replaced, and the third parameter is the replacement values.

How do I find the NAS of a data frame?

The is.na() function returns a logical vector of True and False values to indicate which of the corresponding elements are NA or not. This is followed by the application of which() function which indicates the position of the data elements.


1 Answers

What a great question.

Here's a data.table solution:

# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))


# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]    
#      a b  f  g f.1 g.1
# [1,] 1 3 NA  0 100  11
# [2,] 1 3 NA NA 100  11
# [3,] 1 3 NA  0 100  11
# [4,] 1 3  0  0 100  11
# [5,] 1 3  0 NA 100  11
# First 5 rows of 200 printed.

# In outDT[i, j], on the following two lines 
#   -- i is a Boolean vector indicating which rows will be operated on
#   -- j is an expression saying "(sub)assign from right column (e.g. f.1) to 
#        left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]

# Just keep the four columns ultimately needed   
outDT <- outDT[,list(a,b,g,f)]
#       a b  g   f
#  [1,] 1 3  0   0
#  [2,] 1 3 11   0
#  [3,] 1 3  0   0
#  [4,] 1 3 11   0
#  [5,] 1 3 11   0
# First 5 rows of 200 printed.
like image 177
Josh O'Brien Avatar answered Sep 30 '22 16:09

Josh O'Brien