Question has been edited from the original.
After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:
Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA NA
The following does not work as I expected
library(dplyr) library(Lahman) df <- Batting[ c("yearID", "teamID", "G_batting") ] df <- group_by(df, teamID ) df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)
Source: local data frame [20 x 3] Groups: yearID, teamID
yearID teamID G_batting 1 2004 SFN 11.00000 2 2006 CHN 43.00000 3 2007 CHA 2.00000 4 2008 BOS 5.00000 5 2009 SEA 3.00000 6 2010 SEA 4.00000 7 2012 NYA **49.07894** > mean(Batting$G_battin, na.rm = TRUE) [1] **49.07894**
In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform
from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?
df %.% group_by( yearID ) %.% transform(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting) )
Edit: Replacing transform
with mutate
gives the following error
Error in mutate_impl(.data, named_dots(...), environment()) : INTEGER() can only be applied to a 'integer', not a 'double'
Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also @eddi's answer.
df %.% group_by( teamID ) %.% mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA 47 > mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE) > as.integer(mean_NYA) [1] 47
Edit: Following up on @Romain's comment I installed dplyr from github:
> head(df,10) yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA NA 8 1954 ML1 122 9 1955 ML1 153 10 1956 ML1 153 > df %.% + group_by(teamID) %.% + mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 0 2 2006 CHN 0 3 2007 CHA 0 4 2008 BOS 0 5 2009 SEA 0 6 2010 SEA 1074266112 7 2012 NYA 90693125 8 1954 ML1 122 9 1955 ML1 153 10 1956 ML1 153 .. ... ... ...
So I didn't get the error (good) but I got a (seemingly) strange result.
You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.
To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0. myDataframe is the data frame in which you would like replace all NAs with 0.
The main issue you're having is that mean
returns a double while the G_batting
column is an integer. So wrapping the mean in as.integer
would work, or you'd need to convert the entire column to numeric
I guess.
That said, here are a couple of data.table
alternatives - I didn't check which one is faster.
library(data.table) # using ifelse dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a] # using a temporary column dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]
And this is what I'd want to do ideally (there is an FR about this):
# again, atm this is pure fantasy and will not work dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]
The dplyr
version of the ifelse
is (as in OP):
dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))
I'm not sure how to implement the second data.table
idea in a single line in dplyr
. I'm also not sure how you can stop dplyr
from scrambling/ordering the data (aside from creating an index column).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With