R data.table replace NA with mean for numeric columns and most frequent value for nominal values

Tags:

I've the following data.table

x = structure(list(id1 = c("a", "a", "a", "b", "b", NA), id2 = c(2, 3, NA,3, 4, 5)), .Names = c("id1", "id2"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x1fe4a78>)

I'm trying to replace the NA in each column with separate strategies. For numeric columns I want to replace it with the mean and for factor or character columns I want to replace it with the most frequent value. I tried the following but it just does nothing.

for (j in 1:ncol(x)){
  if(is.numeric(unlist(x[,j,with=FALSE]))){
     m = mean(unlist(x[,j,with=FALSE]))
     set(x,which(is.na(x[[j]])),j,m)
   }else{
     m = sort(table(x),decreasing=TRUE)[[1]]
     set(x,which(is.na(x[[j]])),j,m)
}

670

asked Apr 06 '15 02:04

broccoli

1 Answers

Using base approaches, you can write a function like the following:

myFun <- function(x) {
  if (is.numeric(x)) {
    x[is.na(x)] <- mean(x, na.rm = TRUE)
    x
  } else {
    x[is.na(x)] <- names(which.max(table(x)))
    x
  }
}

... and apply it with:

x[, lapply(.SD, myFun)]
#    id1 id2
# 1:   a 2.0
# 2:   a 3.0
# 3:   a 3.4
# 4:   b 3.0
# 5:   b 4.0
# 6:   a 5.0

Note that which.max will take the first largest value in case there are ties.

I guess it could alternatively be written something like:

myFun <- function(inDT) {
  for (i in 1:ncol(inDT)) {
    temp <- unlist(inDT[, i, with = FALSE], use.names = FALSE)
    set(inDT, which(is.na(temp)), i, 
        if (is.numeric(temp)) {
          mean(temp, na.rm = TRUE) 
        } else {
          names(which.max(table(temp)))
        } )
  }
  inDT
}

y <- copy(x)

myFun(y)
#    id1 id2
# 1:   a 2.0
# 2:   a 3.0
# 3:   a 3.4
# 4:   b 3.0
# 5:   b 4.0
# 6:   a 5.0

113

answered Oct 12 '22 13:10

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                Is it possible to catch error in C for Rf_eval R?
                            
                                From Stata to R: creating a scatterplot with vertical date lines on a subset
                            
                                disabling mapply automatically converting Dates to numeric
                            
                                How do I prevent R function "step" from outputing to the console?
                            
                                Why are the logistic regression results different between statsmodels and R?
                            
                                using fitdist from fitdistplus with binomial distribution
                            
                                Column widths not aligned with table data in pander tables sent from R with sendmailr
                            
                                Determine if data frame is empty
                            
                                Using a vector's print method in a data frame
                            
                                Change font sizes with style sheets for RStudio presentation
                            
                                R set variable equal to what function returns. Re-evaluate variable again each time it is called [duplicate]
                            
                                Topic modelling in R using phrases rather than single words
                            
                                ggplot2 : printing multiple plots in one page with a loop
                            
                                Rvest error: type 'externalptr'
                            
                                tbl_df and data.frame difference when using loops
                            
                                Weird lines appearing in the R graph
                            
                                Separate a column into multiple columns using tidyr::separate with sep=""
                            
                                How to drop columns in a nested data frame in R?
                            
                                Multiple series barplot
                            
                                Which selector to write in rvest package in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R data.table replace NA with mean for numeric columns and most frequent value for nominal values

Tags:

r

lapply

missing-data

data.table

broccoli

People also ask

1 Answers

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us