I have a dataframe that has a scattering of NA's
toy_df
# Y X1 X2 Label
# 5 3 3 A
# 3 NA 2 B
# 3 NA NA C
# 2 NA 6 B
I want to group this by the label field, and count how many non NA values are in each variable for each label.
desired output:
# Label Y X1 X2
# A 1 1 1
# B 2 0 2
# C 1 0 0
I've done this using loops at the moment, but it's slow and untidy and I'm sure there's a better way.
Aggregate seems to get half way there, but it includes NA's in the count.
aggregate(toy_df, list(toy_df$label), FUN=length)
Any ideas appreciated...
Group By Count in R using dplyr You can use group_by() function along with the summarise() from dplyr package to find the group by count in R DataFrame, group_by() returns the grouped_df ( A grouped Data Frame) and use summarise() on grouped df to get the group by count.
Which aggregate function counts the number of non NA values in the group? The SAS function N calculates the number of non-blank numeric values across multiple columns.
To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).
count() lets you quickly count the unique values of one or more variables: df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n()) . count() is paired with tally() , a lower-level helper that is equivalent to df %>% summarise(n = n()) .
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(toy_df)
), grouped by 'Label', loop through the Subset of Data.table (.SD
) and get the sum
of non-NA values (!is.na(x)
)
library(data.table)
setDT(toy_df)[, lapply(.SD, function(x) sum(!is.na(x))), by = Label]
# Label Y X1 X2
#1: A 1 1 1
#2: B 2 0 2
#3: C 1 0 0
Or with dplyr
using the same methodology
library(dplyr)
toy_df %>%
group_by(Label) %>%
summarise_each(funs(sum(!is.na(.))))
Or a base R
option with by
and colSums
grouped by the 4th column on logical matrix (!is.na(toy_df[-4])
)
by(!is.na(toy_df[-4]), toy_df[4], FUN = colSums)
Or with rowsum
with similar approach as in by
except using the rowsum
function.
rowsum(+(!is.na(toy_df[-4])), group=toy_df[,4])
# Y X1 X2
#A 1 1 1
#B 2 0 2
#C 1 0 0
Or in base R
aggregate(toy_df[,1:3], by=list(toy_df$Label), FUN=function(x) { sum(!is.na(x))})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With