Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R group by, counting non-NA values

Tags:

r

na

I have a dataframe that has a scattering of NA's

toy_df
# Y  X1 X2 Label
# 5  3  3  A
# 3  NA 2  B
# 3  NA NA C
# 2  NA 6  B

I want to group this by the label field, and count how many non NA values are in each variable for each label.

desired output:
# Label Y  X1 X2
# A     1  1  1
# B     2  0  2
# C     1  0  0

I've done this using loops at the moment, but it's slow and untidy and I'm sure there's a better way.

Aggregate seems to get half way there, but it includes NA's in the count.

aggregate(toy_df, list(toy_df$label), FUN=length)

Any ideas appreciated...

like image 249
tea_pea Avatar asked Dec 14 '16 19:12

tea_pea


People also ask

How do I count the number of values in a group in R?

Group By Count in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by count in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df to get the group by count.

Which aggregate function counts the number of non NA values in the group?

Which aggregate function counts the number of non NA values in the group? The SAS function N calculates the number of non-blank numeric values across multiple columns.

How do I omit all NA in R?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

What does count N () do in R?

count() lets you quickly count the unique values of one or more variables: df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()) . count() is paired with tally() , a lower-level helper that is equivalent to df %>% summarise(n = n()) .


2 Answers

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(toy_df)), grouped by 'Label', loop through the Subset of Data.table (.SD) and get the sum of non-NA values (!is.na(x))

library(data.table)
setDT(toy_df)[, lapply(.SD, function(x) sum(!is.na(x))), by = Label]
#   Label Y X1 X2
#1:     A 1  1  1
#2:     B 2  0  2
#3:     C 1  0  0

Or with dplyr using the same methodology

library(dplyr)
toy_df %>% 
      group_by(Label) %>%
      summarise_each(funs(sum(!is.na(.))))

Or a base R option with by and colSums grouped by the 4th column on logical matrix (!is.na(toy_df[-4]))

by(!is.na(toy_df[-4]), toy_df[4], FUN = colSums)

Or with rowsum with similar approach as in by except using the rowsum function.

rowsum(+(!is.na(toy_df[-4])), group=toy_df[,4])
#  Y X1 X2
#A 1  1  1
#B 2  0  2
#C 1  0  0
like image 149
akrun Avatar answered Oct 21 '22 09:10

akrun


Or in base R

aggregate(toy_df[,1:3], by=list(toy_df$Label), FUN=function(x) { sum(!is.na(x))})
like image 33
G5W Avatar answered Oct 21 '22 09:10

G5W