Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Sum Complete.cases in one column grouped by (or sorted by) a value in another column

I'm using the airquality data set available in R, and attempting to count the number of rows within the data that do not contain any NAs, while aggregating by Month.

The data looks like this:

head(airquality)
#   Ozone Solar.R Wind Temp Month Day
# 1    41     190  7.4   67     5   1
# 2    36     118  8.0   72     5   2
# 3    12     149 12.6   74     5   3
# 4    18     313 11.5   62     5   4
# 5    NA      NA 14.3   56     5   5
# 6    28      NA 14.9   66     5   6

As you can see, I have NAs in columns Ozone and Solar.R. I used the function complete.cases as follows:

x  <- airquality[,1] # for the Ozone
y  <- airquality[,2] # for the Solar.R
ok <- complete.cases(x,y)

And then to check:

nrow(airquality)
# [1] 153
sum(!ok)
# [1] 42
sum(ok)
# [1] 111

which is great.

But now, I'd like to pull that data apart to sort by Month (Column5) and this is where I'm running into problems - in trying to aggregate or sort by the value in column5 (Month).

I was able to get this to run, it won't sort by Month yet (I just wanted to make sure I could get the function to run):

aggregate(x = sum(complete.cases(airquality)), by= list(nrow(airquality)), FUN = sum)
#   Group.1   x
# 1     153 111

OK... so to sort it out. I am trying to use the by part of the aggregate function to sort. I tried many variations of the column5 within airquality.

- airquality[,5]
- airquality[,"Month"]

I get these errors:

aggregate(x = sum(complete.cases(airquality)), by= list(airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) : 
#   arguments must have same length

aggregate(x = sum(complete.cases(airquality)), by= 
      list(sum(complete.cases(airquality)),airquality[,5]), FUN = sum)
# Error in aggregate.data.frame(as.data.frame(x), ...) : 
#   arguments must have same length

I tried to search further into the ?aggregate(x, ...) function. Namely on the by part...

by - a list of grouping elements, each as long as the variables in the data frame x. The elements are coerced to factors before use.

I looked up ?factor, but can't seem to see how to apply it (if even necessary in this case). I also tried putting break = into it but didn't work.

None of the "Questions that may already have your answer" seem to apply, many of which give solutions in C# and SQL.

Edit: Expected outcome

Count  Month
  24       5
   9       6
  26       7
  23       8
  29       9
like image 377
Paul Avatar asked Dec 15 '22 22:12

Paul


1 Answers

As an addition to the other answers, you could do it with dplyr.

require(dplyr)

airquality %.%
  group_by(Month) %.%
  summarize(incomplete = sum(!complete.cases(Ozone, Solar.R)),
             complete = sum(complete.cases(Ozone, Solar.R)))

#  Month incomplete complete
#1     5          7       24
#2     6         21        9
#3     7          5       26
#4     8          8       23
#5     9          1       29
like image 117
talat Avatar answered May 24 '23 07:05

talat