Question has been edited from the original. After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data: <pre class="prettyprint"><code>Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA NA </code></pre> The following does not work as I expected <pre class="prettyprint"><code>library(dplyr) library(Lahman) df <- Batting[ c("yearID", "teamID", "G_batting") ] df <- group_by(df, teamID ) df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE) </code></pre> Source: local data frame [20 x 3] Groups: yearID, teamID <pre class="prettyprint"><code> yearID teamID G_batting 1 2004 SFN 11.00000 2 2006 CHN 43.00000 3 2007 CHA 2.00000 4 2008 BOS 5.00000 5 2009 SEA 3.00000 6 2010 SEA 4.00000 7 2012 NYA **49.07894** > mean(Batting$G_battin, na.rm = TRUE) [1] **49.07894** </code></pre> In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using <code>transform</code> from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this? <pre class="prettyprint"><code>df %.% group_by( yearID ) %.% transform(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting) ) </code></pre> Edit: Replacing <code>transform</code> with <code>mutate</code> gives the following error <pre class="prettyprint"><code>Error in mutate_impl(.data, named_dots(...), environment()) : INTEGER() can only be applied to a 'integer', not a 'double' </code></pre> Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also @eddi's answer. <pre class="prettyprint"><code>df %.% group_by( teamID ) %.% mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA 47 > mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE) > as.integer(mean_NYA) [1] 47 </code></pre> Edit: Following up on @Romain's comment I installed dplyr from github: <pre class="prettyprint"><code>> head(df,10) yearID teamID G_batting 1 2004 SFN 11 2 2006 CHN 43 3 2007 CHA 2 4 2008 BOS 5 5 2009 SEA 3 6 2010 SEA 4 7 2012 NYA NA 8 1954 ML1 122 9 1955 ML1 153 10 1956 ML1 153 > df %.% + group_by(teamID) %.% + mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID yearID teamID G_batting 1 2004 SFN 0 2 2006 CHN 0 3 2007 CHA 0 4 2008 BOS 0 5 2009 SEA 0 6 2010 SEA 1074266112 7 2012 NYA 90693125 8 1954 ML1 122 9 1955 ML1 153 10 1956 ML1 153 .. ... ... ... </code></pre> So I didn't get the error (good) but I got a (seemingly) strange result.

The main issue you're having is that <code>mean</code> returns a double while the <code>G_batting</code> column is an integer. So wrapping the mean in <code>as.integer</code> would work, or you'd need to convert the entire column to <code>numeric</code> I guess. That said, here are a couple of <code>data.table</code> alternatives - I didn't check which one is faster. <pre class="prettyprint"><code>library(data.table) # using ifelse dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a] # using a temporary column dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL] </code></pre> And this is what I'd want to do ideally (there is an FR about this): <pre class="prettyprint"><code># again, atm this is pure fantasy and will not work dt[, b[is.na(b)] := mean(b, na.rm = T), by = a] </code></pre> <hr> The <code>dplyr</code> version of the <code>ifelse</code> is (as in OP): <pre class="prettyprint"><code>dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b)) </code></pre> I'm not sure how to implement the second <code>data.table</code> idea in a single line in <code>dplyr</code>. I'm also not sure how you can stop <code>dplyr</code> from scrambling/ordering the data (aside from creating an index column).

replace NA in a dplyr chain

Tags:

r

dplyr

Question has been edited from the original.

After reading this interesting discussion I was wondering how to replace NAs in a column using dplyr in, for example, the Lahman batting data:

Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        NA

The following does not work as I expected

library(dplyr) library(Lahman)  df <- Batting[ c("yearID", "teamID", "G_batting") ] df <- group_by(df, teamID ) df$G_batting[is.na(df$G_batting)] <- mean(df$G_batting, na.rm = TRUE)

Source: local data frame [20 x 3] Groups: yearID, teamID

   yearID teamID G_batting 1    2004    SFN  11.00000 2    2006    CHN  43.00000 3    2007    CHA   2.00000 4    2008    BOS   5.00000 5    2009    SEA   3.00000 6    2010    SEA   4.00000 7    2012    NYA  **49.07894**  > mean(Batting$G_battin, na.rm = TRUE) [1] **49.07894**

In fact it imputed the overall mean and not the group mean. How would you do this in a dplyr chain? Using transform from base R also does not work as it imputed the overall mean and not the group mean. Also this approach converts the data to a regular dat. a frame. Is there a better way to do this?

df %.%    group_by( yearID ) %.%   transform(G_batting = ifelse(is.na(G_batting),      mean(G_batting, na.rm = TRUE),      G_batting)   )

Edit: Replacing transform with mutate gives the following error

Error in mutate_impl(.data, named_dots(...), environment()) :    INTEGER() can only be applied to a 'integer', not a 'double'

Edit: Adding as.integer seems to resolve the error and does produce the expected result. See also @eddi's answer.

df %.%    group_by( teamID ) %.%   mutate(G_batting = ifelse(is.na(G_batting), as.integer(mean(G_batting, na.rm = TRUE)), G_batting))  Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        47  > mean_NYA <- mean(filter(df, teamID == "NYA")$G_batting, na.rm = TRUE) > as.integer(mean_NYA) [1] 47

Edit: Following up on @Romain's comment I installed dplyr from github:

> head(df,10)    yearID teamID G_batting 1    2004    SFN        11 2    2006    CHN        43 3    2007    CHA         2 4    2008    BOS         5 5    2009    SEA         3 6    2010    SEA         4 7    2012    NYA        NA 8    1954    ML1       122 9    1955    ML1       153 10   1956    ML1       153  > df %.%  +   group_by(teamID)  %.% +   mutate(G_batting = ifelse(is.na(G_batting), mean(G_batting, na.rm = TRUE), G_batting)) Source: local data frame [96,600 x 3] Groups: teamID     yearID teamID  G_batting 1    2004    SFN          0 2    2006    CHN          0 3    2007    CHA          0 4    2008    BOS          0 5    2009    SEA          0 6    2010    SEA 1074266112 7    2012    NYA   90693125 8    1954    ML1        122 9    1955    ML1        153 10   1956    ML1        153 ..    ...    ...        ...

So I didn't get the error (good) but I got a (seemingly) strange result.

245

asked Feb 11 '14 22:02

Vincent

1 Answers

The main issue you're having is that mean returns a double while the G_batting column is an integer. So wrapping the mean in as.integer would work, or you'd need to convert the entire column to numeric I guess.

That said, here are a couple of data.table alternatives - I didn't check which one is faster.

library(data.table)  # using ifelse dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b := ifelse(is.na(b), mean(b, na.rm = T), b), by = a]  # using a temporary column dt = data.table(a = 1:2, b = c(1,2,NA,NA,3,4,5,6,7,8)) dt[, b.mean := mean(b, na.rm = T), by = a][is.na(b), b := b.mean][, b.mean := NULL]

And this is what I'd want to do ideally (there is an FR about this):

# again, atm this is pure fantasy and will not work dt[, b[is.na(b)] := mean(b, na.rm = T), by = a]

The dplyr version of the ifelse is (as in OP):

dt %>% group_by(a) %>% mutate(b = ifelse(is.na(b), mean(b, na.rm = T), b))

I'm not sure how to implement the second data.table idea in a single line in dplyr. I'm also not sure how you can stop dplyr from scrambling/ordering the data (aside from creating an index column).

answered Oct 14 '22 06:10

eddi

Related questions
                            
                                Extend contigency table with proportions (percentages)
                            
                                How to create a consecutive group number
                            
                                Storing ggplot objects in a list from within loop in R
                            
                                Create new dummy variable columns from categorical variable
                            
                                generating a vector of difference between two vectors
                            
                                Restart R within Rstudio
                            
                                R convert dataframe to JSON
                            
                                Convert seconds to days: hours:minutes:seconds
                            
                                Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
                            
                                How to plot with a png as background? [duplicate]
                            
                                Error message when running simple 'rename' function in R
                            
                                Basic lag in R vector/dataframe
                            
                                Convert all columns to characters in a data.frame
                            
                                ggplot2 - jitter and position dodge together
                            
                                Populating a data frame in R in a loop
                            
                                Filter a vector of strings based on string matching
                            
                                Plot a heart in R [duplicate]
                            
                                creating "radar chart" (a.k.a. star plot; spider plot) using ggplot2 in R
                            
                                Binding javascript (d3.js) to shiny
                            
                                RcppArmadillo pass user-defined function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With