Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace NA in a dplyr chain

Tags:

r

dplyr

Question has been edited from the original.

After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:

Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        NA 

The following does not work as I expected

library(dplyr) library(Lahman)  df <- Batting[ c("yearID", "teamID", "G_batting") ] df <- group_by(df, teamID ) df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE) 

Source: local data frame [20 x 3] Groups: yearID, teamID

   yearID teamID G_batting 1    2004    SFN  11.00000 2    2006    CHN  43.00000 3    2007    CHA   2.00000 4    2008    BOS   5.00000 5    2009    SEA   3.00000 6    2010    SEA   4.00000 7    2012    NYA  **49.07894**  > mean(Batting$G_battin, na.rm = TRUE) [1] **49.07894** 

In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?

df %.%    group_by( yearID ) %.%   transform(G_batting = ifelse(is.na(G_batting),      mean(G_batting, na.rm = TRUE),      G_batting)   ) 

Edit: Replacing transform with mutate gives the following error

Error in mutate_impl(.data, named_dots(...), environment()) :    INTEGER() can only be applied to a 'integer', not a 'double' 

Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also @eddi's answer.

df %.%    group_by( teamID ) %.%   mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))  Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        47  > mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE) > as.integer(mean_NYA) [1] 47 

Edit: Following up on @Romain's comment I installed dplyr from github:

> head(df,10)    yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        NA 8    1954    ML1       122 9    1955    ML1       153 10   1956    ML1       153  > df %.%  +   group_by(teamID)  %.% +   mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID  G_batting 1    2004    SFN          0 2    2006    CHN          0 3    2007    CHA          0 4    2008    BOS          0 5    2009    SEA          0 6    2010    SEA 1074266112 7    2012    NYA   90693125 8    1954    ML1        122 9    1955    ML1        153 10   1956    ML1        153 ..    ...    ...        ... 

So I didn't get the error (good) but I got a (seemingly) strange result.

like image 245
Vincent Avatar asked Feb 11 '14 22:02

Vincent


People also ask

How do I replace my Dplyr na?

You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.

How do I replace Na in R?

The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.

How do I replace NAs with 0 in R?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0. myDataframe is the data frame in which you would like replace all NAs with 0.


1 Answers

The main issue you're having is that mean returns a double while the G_batting column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.

That said, here are a couple of data.table alternatives - I didn't check which one is faster.

library(data.table)  # using ifelse dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]  # using a temporary column dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL] 

And this is what I'd want to do ideally (there is an FR about this):

# again, atm this is pure fantasy and will not work dt[, b[is.na(b)] := mean(b, na.rm = T), by = a] 

The dplyr version of the ifelse is (as in OP):

dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b)) 

I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

like image 52
eddi Avatar answered Oct 14 '22 06:10

eddi