Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all the sum in aggregate function?

Tags:

r

aggregate

Here's some sample data:

dat="x1 x2 x3 x4 x5
1   C  1 16 NA 16
2   A  1 16 16 NA
3   A  1 16 16 NA
4   A  4 64 64 NA
5   C  4 64 NA 64
6   A  1 16 16 NA
7   A  1 16 16 NA
8   A  1 16 16 NA
9   B  4 64 32 32
10  A  3 48 48 NA
11  B  4 64 32 32
12  B  3 48 32 16"

data<-read.table(text=dat,header=TRUE)   
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)   
 x1 x2  x3 x4 x5   
1  B 11 176 96 8   

How do I get the sum of A and C as well in x1?

 aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)  
   x1 x2  x3 x4 x5
 1  B 11 176 96 80 

When I use sqldf:

library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
  sum(x2) sum(x3) sum(x4) sum(x5)
1      12     192     192    <NA>
2      11     176      96      80
3       5      80      NA      80

Why do I get <NA> in the first line, but NA in the third line ? What is the differences between them? Why do I get the <NA>? there is no <NA> in data!

str(data)
'data.frame':   12 obs. of  5 variables:
 $ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
 $ x2: int  1 1 1 4 4 1 1 1 4 3 ...
 $ x3: int  16 16 16 64 64 16 16 16 64 48 ...
 $ x4: int  NA 16 16 64 NA 16 16 16 32 48 ...
 $ x5: int  16 NA NA NA 64 NA NA NA 32 NA ...

The sqldf problem remains here, why sum(x4) gets NA, on the contrary sum(x5) gets <NA>?

I can prove that all NA both in x4 and x5 is the same this way:

data[is.na(data)] <- 0     

> data
   x1 x2 x3 x4 x5
1   C  1 16  0 16
2   A  1 16 16  0
3   A  1 16 16  0
4   A  4 64 64  0
5   C  4 64  0 64
6   A  1 16 16  0
7   A  1 16 16  0
8   A  1 16 16  0
9   B  4 64 32 32
10  A  3 48 48  0
11  B  4 64 32 32
12  B  3 48 32 16

So the fact that sqldf treats sum(x4) and sum(x5) differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.

like image 596
showkey Avatar asked Dec 30 '13 11:12

showkey


People also ask

How do I get the sum of all values in a column in SQL?

If you need to add a group of numbers in your table you can use the SUM function in SQL. This is the basic syntax: SELECT SUM(column_name) FROM table_name; The SELECT statement in SQL tells the computer to get data from the table.

Which aggregate function is used to find the sum of columns?

SUM. The aggregate function SUM() is used to calculate the sum of all the values of the select column. It returns the sum of values in a set.

What is sum aggregate function?

The Oracle SUM() function is an aggregate function that returns the sum of all or distinct values in a set of values. The Oracle SUM() function accepts a clause which can be either DISTINCT or ALL . The DISTINCT clause forces the SUM() function to calculate the sum of unique values.


2 Answers

Here's the data.table way in case you're interested:

require(data.table)
dt <- data.table(data)
dt[, lapply(.SD, sum, na.rm=TRUE), by=x1]
#    x1 x2  x3  x4 x5
# 1:  C  5  80   0 80
# 2:  A 12 192 192  0
# 3:  B 11 176  96 80

If you want sum to return NA instead of the sum after removing NA's, just remove the na.rm=TRUE argument.

.SD here is an internal data.table variable that constructs, by default, all the columns not in by - here all except x1. You can check the contents of .SD by doing:

dt[, print(.SD), by=x1]

to get an idea of what's .SD. If you're interested check ?data.table for other internal (and very useful) special variables like .I, .N, .GRP etc..

like image 166
Arun Avatar answered Oct 25 '22 20:10

Arun


Because of how the formula method for aggregate handles NA values by default, you need to override that before using the na.rm argument from sum. You can do this by setting na.action to NULL or na.pass:

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = NULL)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = na.pass)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

Regarding sqldf, it seems like the columns are being cast to different types depending on whether the item in the first row of the first grouping variable is an NA or not. If it is an NA, that column gets cast as character.

Compare:

df1 <- data.frame(id = c(1, 1, 2, 2, 2),
                 A = c(1, 1, NA, NA, NA),
                 B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df1 group by id")
#   sum(A) sum(B)
# 1      2   <NA>
# 2     NA    3.0

df2 <- data.frame(id = c(2, 2, 1, 1, 1),
                  A = c(1, 1, NA, NA, NA),
                  B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df2 group by id")
#   sum(A) sum(B)
# 1   <NA>      3
# 2    2.0     NA

However, there is an easy workaround: reassign the original name to the new columns being created. Perhaps that let's SQLite inherit some of the information from the previous database? (I don't really use SQL.)

Example (with the same "df2" created earlier):

sqldf("select sum(A) `A`, sum(B) `B` from df2 group by id")
#    A  B
# 1 NA  3
# 2  2 NA

You can easily use paste to create your select statement:

Aggs <- paste("sum(", names(data)[-1], ") `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192 NA
# 2 11 176  96 80
# 3  5  80  NA 80
str(.Last.value)
# 'data.frame':  3 obs. of  4 variables:
#  $ x2: int  12 11 5
#  $ x3: int  192 176 80
#  $ x4: int  192 96 NA
#  $ x5: int  NA 80 80

A similar approach can be taken if you want NA to be replaced with 0:

Aggs <- paste("sum(ifnull(", names(data)[-1], ", 0)) `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192  0
# 2 11 176  96 80
# 3  5  80   0 80
like image 5
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 25 '22 20:10

A5C1D2H2I1M1N2O1R2T1