Here's some sample data:
dat="x1 x2 x3 x4 x5
1 C 1 16 NA 16
2 A 1 16 16 NA
3 A 1 16 16 NA
4 A 4 64 64 NA
5 C 4 64 NA 64
6 A 1 16 16 NA
7 A 1 16 16 NA
8 A 1 16 16 NA
9 B 4 64 32 32
10 A 3 48 48 NA
11 B 4 64 32 32
12 B 3 48 32 16"
data<-read.table(text=dat,header=TRUE)
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)
x1 x2 x3 x4 x5
1 B 11 176 96 8
How do I get the sum of A
and C
as well in x1
?
aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)
x1 x2 x3 x4 x5
1 B 11 176 96 80
When I use sqldf
:
library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
sum(x2) sum(x3) sum(x4) sum(x5)
1 12 192 192 <NA>
2 11 176 96 80
3 5 80 NA 80
Why do I get <NA>
in the first line, but NA
in the third line ?
What is the differences between them? Why do I get the <NA>
? there is no <NA>
in data!
str(data)
'data.frame': 12 obs. of 5 variables:
$ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
$ x2: int 1 1 1 4 4 1 1 1 4 3 ...
$ x3: int 16 16 16 64 64 16 16 16 64 48 ...
$ x4: int NA 16 16 64 NA 16 16 16 32 48 ...
$ x5: int 16 NA NA NA 64 NA NA NA 32 NA ...
The sqldf problem remains here, why sum(x4)
gets NA
, on the contrary sum(x5)
gets <NA>
?
I can prove that all NA
both in x4 and x5 is the same this way:
data[is.na(data)] <- 0
> data
x1 x2 x3 x4 x5
1 C 1 16 0 16
2 A 1 16 16 0
3 A 1 16 16 0
4 A 4 64 64 0
5 C 4 64 0 64
6 A 1 16 16 0
7 A 1 16 16 0
8 A 1 16 16 0
9 B 4 64 32 32
10 A 3 48 48 0
11 B 4 64 32 32
12 B 3 48 32 16
So the fact that sqldf treats sum(x4)
and sum(x5)
differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.
If you need to add a group of numbers in your table you can use the SUM function in SQL. This is the basic syntax: SELECT SUM(column_name) FROM table_name; The SELECT statement in SQL tells the computer to get data from the table.
SUM. The aggregate function SUM() is used to calculate the sum of all the values of the select column. It returns the sum of values in a set.
The Oracle SUM() function is an aggregate function that returns the sum of all or distinct values in a set of values. The Oracle SUM() function accepts a clause which can be either DISTINCT or ALL . The DISTINCT clause forces the SUM() function to calculate the sum of unique values.
Here's the data.table
way in case you're interested:
require(data.table)
dt <- data.table(data)
dt[, lapply(.SD, sum, na.rm=TRUE), by=x1]
# x1 x2 x3 x4 x5
# 1: C 5 80 0 80
# 2: A 12 192 192 0
# 3: B 11 176 96 80
If you want sum
to return NA
instead of the sum after removing NA's, just remove the na.rm=TRUE
argument.
.SD
here is an internal data.table
variable that constructs, by default, all the columns not in by
- here all except x1
. You can check the contents of .SD
by doing:
dt[, print(.SD), by=x1]
to get an idea of what's .SD
. If you're interested check ?data.table
for other internal (and very useful) special variables like .I
, .N
, .GRP
etc..
Because of how the formula method for aggregate
handles NA
values by default, you need to override that before using the na.rm
argument from sum
. You can do this by setting na.action
to NULL
or na.pass
:
aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data,
na.rm = TRUE, na.action = NULL)
# x1 x2 x3 x4 x5
# 1 A 12 192 192 0
# 2 B 11 176 96 80
# 3 C 5 80 0 80
aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data,
na.rm = TRUE, na.action = na.pass)
# x1 x2 x3 x4 x5
# 1 A 12 192 192 0
# 2 B 11 176 96 80
# 3 C 5 80 0 80
Regarding sqldf
, it seems like the columns are being cast to different types depending on whether the item in the first row of the first grouping variable is an NA
or not. If it is an NA
, that column gets cast as character
.
Compare:
df1 <- data.frame(id = c(1, 1, 2, 2, 2),
A = c(1, 1, NA, NA, NA),
B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df1 group by id")
# sum(A) sum(B)
# 1 2 <NA>
# 2 NA 3.0
df2 <- data.frame(id = c(2, 2, 1, 1, 1),
A = c(1, 1, NA, NA, NA),
B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df2 group by id")
# sum(A) sum(B)
# 1 <NA> 3
# 2 2.0 NA
However, there is an easy workaround: reassign the original name to the new columns being created. Perhaps that let's SQLite inherit some of the information from the previous database? (I don't really use SQL.)
Example (with the same "df2" created earlier):
sqldf("select sum(A) `A`, sum(B) `B` from df2 group by id")
# A B
# 1 NA 3
# 2 2 NA
You can easily use paste
to create your select
statement:
Aggs <- paste("sum(", names(data)[-1], ") `",
names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
# x2 x3 x4 x5
# 1 12 192 192 NA
# 2 11 176 96 80
# 3 5 80 NA 80
str(.Last.value)
# 'data.frame': 3 obs. of 4 variables:
# $ x2: int 12 11 5
# $ x3: int 192 176 80
# $ x4: int 192 96 NA
# $ x5: int NA 80 80
A similar approach can be taken if you want NA
to be replaced with 0
:
Aggs <- paste("sum(ifnull(", names(data)[-1], ", 0)) `",
names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
# x2 x3 x4 x5
# 1 12 192 192 0
# 2 11 176 96 80
# 3 5 80 0 80
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With