Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: summarise multiple column (numeric, character) and remove NAs

Tags:

r

dplyr

na.rm

I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.

I need to:

  • remove NAs from numeric columns
  • calculate the mean of each of the numeric columns
  • extract the first element of the character columns

Let's say, we're using modified iris data as below:

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs

I have ~50 columns in total, numeric and character mixed together. I've been trying something like:

giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)

What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...

Help anyone?

like image 326
rpl Avatar asked Jan 18 '16 14:01

rpl


1 Answers

You could try this with an if/else in the funs of summarise_each:

iris %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Since you have some NA's also in grouping columns, you could add a filter statement:

iris %>% 
  filter(!is.na(Species) & !is.na(year)) %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
#     Species  year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#      (fctr) (dbl)        (dbl)       (dbl)        (dbl)       (dbl) (chr)
#1     setosa  2000        5.025    3.479167       1.4625       0.250   red
#2     setosa  3000        4.984    3.376000       1.4640       0.244   red
#3 versicolor  2000        6.012    2.776000       4.3120       1.344 green
#4 versicolor  3000        5.860    2.764000       4.2080       1.308 green
#5  virginica  2000        6.576    2.928000       5.6400       2.044  blue
#6  virginica  3000        6.600    3.020000       5.4640       2.008  blue

To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.)).


You could also try data.table:

library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
     if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}), 
     by = list(Species, year)]
#      Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1:     setosa 2000        5.025    3.479167       1.4625       0.250   red
#2:     setosa 3000        4.984    3.376000       1.4640       0.244   red
#3: versicolor 2000        6.012    2.776000       4.3120       1.344 green
#4: versicolor 3000        5.860    2.764000       4.2080       1.308 green
#5:  virginica 2000        6.576    2.928000       5.6400       2.044  blue
#6:  virginica 3000        6.600    3.020000       5.4640       2.008  blue
like image 106
talat Avatar answered Sep 29 '22 14:09

talat