I have a dataframe that has a scattering of NA's <pre class="prettyprint"><code>toy_df # Y X1 X2 Label # 5 3 3 A # 3 NA 2 B # 3 NA NA C # 2 NA 6 B </code></pre> I want to group this by the label field, and count how many non NA values are in each variable for each label. <pre class="prettyprint"><code>desired output: # Label Y X1 X2 # A 1 1 1 # B 2 0 2 # C 1 0 0 </code></pre> I've done this using loops at the moment, but it's slow and untidy and I'm sure there's a better way. Aggregate seems to get half way there, but it includes NA's in the count. <pre class="prettyprint"><code>aggregate(toy_df, list(toy_df$label), FUN=length) </code></pre> Any ideas appreciated...

We can use <code>data.table</code>. Convert the 'data.frame' to 'data.table' (<code>setDT(toy_df)</code>), grouped by 'Label', loop through the Subset of Data.table (<code>.SD</code>) and get the <code>sum</code> of non-NA values (<code>!is.na(x)</code>) <pre class="prettyprint"><code>library(data.table) setDT(toy_df)[, lapply(.SD, function(x) sum(!is.na(x))), by = Label] # Label Y X1 X2 #1: A 1 1 1 #2: B 2 0 2 #3: C 1 0 0 </code></pre> <hr> Or with <code>dplyr</code> using the same methodology <pre class="prettyprint"><code>library(dplyr) toy_df %>% group_by(Label) %>% summarise_each(funs(sum(!is.na(.)))) </code></pre> <hr> Or a <code>base R</code> option with <code>by</code> and <code>colSums</code> grouped by the 4th column on logical matrix (<code>!is.na(toy_df[-4])</code>) <pre class="prettyprint"><code>by(!is.na(toy_df[-4]), toy_df[4], FUN = colSums) </code></pre> Or with <code>rowsum</code> with similar approach as in <code>by</code> except using the <code>rowsum</code> function. <pre class="prettyprint"><code>rowsum(+(!is.na(toy_df[-4])), group=toy_df[,4]) # Y X1 X2 #A 1 1 1 #B 2 0 2 #C 1 0 0 </code></pre>

Or in base R <pre class="prettyprint"><code>aggregate(toy_df[,1:3], by=list(toy_df$Label), FUN=function(x) { sum(!is.na(x))}) </code></pre>

R group by, counting non-NA values

Tags:

r

na

I have a dataframe that has a scattering of NA's

toy_df
# Y  X1 X2 Label
# 5  3  3  A
# 3  NA 2  B
# 3  NA NA C
# 2  NA 6  B

I want to group this by the label field, and count how many non NA values are in each variable for each label.

desired output:
# Label Y  X1 X2
# A     1  1  1
# B     2  0  2
# C     1  0  0

I've done this using loops at the moment, but it's slow and untidy and I'm sure there's a better way.

Aggregate seems to get half way there, but it includes NA's in the count.

aggregate(toy_df, list(toy_df$label), FUN=length)

Any ideas appreciated...

249

asked Dec 14 '16 19:12

tea_pea

2 Answers

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(toy_df)), grouped by 'Label', loop through the Subset of Data.table (.SD) and get the sum of non-NA values (!is.na(x))

library(data.table)
setDT(toy_df)[, lapply(.SD, function(x) sum(!is.na(x))), by = Label]
#   Label Y X1 X2
#1:     A 1  1  1
#2:     B 2  0  2
#3:     C 1  0  0

Or with dplyr using the same methodology

library(dplyr)
toy_df %>% 
      group_by(Label) %>%
      summarise_each(funs(sum(!is.na(.))))

Or a base R option with by and colSums grouped by the 4th column on logical matrix (!is.na(toy_df[-4]))

by(!is.na(toy_df[-4]), toy_df[4], FUN = colSums)

Or with rowsum with similar approach as in by except using the rowsum function.

rowsum(+(!is.na(toy_df[-4])), group=toy_df[,4])
#  Y X1 X2
#A 1  1  1
#B 2  0  2
#C 1  0  0

149

answered Oct 21 '22 09:10

akrun

Or in base R

aggregate(toy_df[,1:3], by=list(toy_df$Label), FUN=function(x) { sum(!is.na(x))})

answered Oct 21 '22 09:10

G5W

Related questions
                            
                                see memory usage of the computer vs of memory usage of R in Rstudio?
                            
                                How to convert a list() to an ellipsis in R?
                            
                                Index of non-unique element in data frame
                            
                                Using scale_size_area (ggplot2) to plot points of size "0" as completely absent
                            
                                Nested ifelse with varying columns in data.table
                            
                                R: data.table. How to save dates properly with fwrite?
                            
                                How to create a different report for each subset of a data frame with R markdown?
                            
                                ggplot donut chart percentage labels
                            
                                Learning data.table - how to update values by row number and column name
                            
                                Using a variable in update() in R to update formula
                            
                                R - How to get row & column subscripts of matched elements from a distance matrix
                            
                                Changing the prompt in a multilanguage knitr/RMarkdown document
                            
                                Remove seconds from time in R
                            
                                converting zoo to dataframe
                            
                                R: How do I add an extra function to a package?
                            
                                Matrix multiplication in R: requires numeric/complex matrix/vector arguments
                            
                                How do I write a csv from R without quoted values? [duplicate]
                            
                                How to scrape tables inside a comment tag in html with R?
                            
                                Horizontal grid lines in plotly R
                            
                                Why ggplot2 legend not show in the graph [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With