Creating a function to replace NAs from one data.frame with values from another

Tags:

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.

Here's what I cooked up, along with an example of how I use it:

fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
 mergedDf <- merge(naDf, fillDf, by=mergeCols)
 for (col in fillCols){
   colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
   colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
   k <- which( is.na( colWithNas ) )
   colWithNas[k] <- colWithOutNas[k]
   mergedDf[col] <- colWithNas
   mergedDf[[paste(col, "x", sep=".")]] <- NULL
   mergedDf[[paste(col, "y", sep=".")]] <- NULL
 }
 return(mergedDf)
}

## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )

So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.

Any guidance or refactoring would be helpful.

EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.

406

asked Dec 01 '11 23:12

JD Long

1 Answers

What a great question.

Here's a data.table solution:

# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))


# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]    
#      a b  f  g f.1 g.1
# [1,] 1 3 NA  0 100  11
# [2,] 1 3 NA NA 100  11
# [3,] 1 3 NA  0 100  11
# [4,] 1 3  0  0 100  11
# [5,] 1 3  0 NA 100  11
# First 5 rows of 200 printed.

# In outDT[i, j], on the following two lines 
#   -- i is a Boolean vector indicating which rows will be operated on
#   -- j is an expression saying "(sub)assign from right column (e.g. f.1) to 
#        left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]

# Just keep the four columns ultimately needed   
outDT <- outDT[,list(a,b,g,f)]
#       a b  g   f
#  [1,] 1 3  0   0
#  [2,] 1 3 11   0
#  [3,] 1 3  0   0
#  [4,] 1 3 11   0
#  [5,] 1 3 11   0
# First 5 rows of 200 printed.

177

answered Sep 30 '22 16:09

Josh O'Brien

Related questions
                            
                                Non-linear color distribution over the range of values in a geom_raster
                            
                                Justification of multiple legends in ggmap/ggplot2
                            
                                SSL certificate failed for twitteR in R
                            
                                ggplot2: Is there a way to overlay a single plot to all facets in a ggplot
                            
                                No dimensions of non-empty numeric vector in R
                            
                                Annotating facet title as strip over facet
                            
                                These packages need to be imported from (in the NAMESPACE file)
                            
                                call RMarkdown on command line using a.R that is passed a file
                            
                                Simple question regarding the use of outer() and user-defined functions?
                            
                                How to read data from Microsoft Access .accdb database files into R?
                            
                                ggplot2: Font Style in label expression
                            
                                SVM with cross validation in R using caret
                            
                                adding shade to R lineplot denotes standard error
                            
                                How can I left align latex equations in R Markdown?
                            
                                Duplicate list names in R
                            
                                In geom_text, can "labels=scales::percent" be rounded?
                            
                                savewidget from htmlwidget in R , cannot save html file in another folder
                            
                                ggplot2 width of boxplot
                            
                                Pause between gganimate loops
                            
                                tool to auto-format R code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With