Here's some sample data: <pre class="prettyprint"><code>dat="x1 x2 x3 x4 x5 1 C 1 16 NA 16 2 A 1 16 16 NA 3 A 1 16 16 NA 4 A 4 64 64 NA 5 C 4 64 NA 64 6 A 1 16 16 NA 7 A 1 16 16 NA 8 A 1 16 16 NA 9 B 4 64 32 32 10 A 3 48 48 NA 11 B 4 64 32 32 12 B 3 48 32 16" data<-read.table(text=dat,header=TRUE) aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data) x1 x2 x3 x4 x5 1 B 11 176 96 8 </code></pre> How do I get the sum of <code>A</code> and <code>C</code> as well in <code>x1</code>? <pre class="prettyprint"><code> aggregate(.~x1, FUN=sum, data=data, na.action = na.omit) x1 x2 x3 x4 x5 1 B 11 176 96 80 </code></pre> When I use <code>sqldf</code>: <pre class="prettyprint"><code>library("sqldf") sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1") sum(x2) sum(x3) sum(x4) sum(x5) 1 12 192 192 <NA> 2 11 176 96 80 3 5 80 NA 80 </code></pre> Why do I get <code><NA></code> in the first line, but <code>NA</code> in the third line ? What is the differences between them? Why do I get the <code><NA></code>? there is no <code><NA></code> in data! <pre class="prettyprint"><code>str(data) 'data.frame': 12 obs. of 5 variables: $ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ... $ x2: int 1 1 1 4 4 1 1 1 4 3 ... $ x3: int 16 16 16 64 64 16 16 16 64 48 ... $ x4: int NA 16 16 64 NA 16 16 16 32 48 ... $ x5: int 16 NA NA NA 64 NA NA NA 32 NA ... </code></pre> The sqldf problem remains here, why <code>sum(x4)</code> gets <code>NA</code>, on the contrary <code>sum(x5)</code> gets <code><NA></code>? I can prove that all <code>NA</code> both in x4 and x5 is the same this way: <pre class="prettyprint"><code>data[is.na(data)] <- 0 > data x1 x2 x3 x4 x5 1 C 1 16 0 16 2 A 1 16 16 0 3 A 1 16 16 0 4 A 4 64 64 0 5 C 4 64 0 64 6 A 1 16 16 0 7 A 1 16 16 0 8 A 1 16 16 0 9 B 4 64 32 32 10 A 3 48 48 0 11 B 4 64 32 32 12 B 3 48 32 16 </code></pre> So the fact that sqldf treats <code>sum(x4)</code> and <code>sum(x5)</code> differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.

Here's the <code>data.table</code> way in case you're interested: <pre class="prettyprint"><code>require(data.table) dt <- data.table(data) dt[, lapply(.SD, sum, na.rm=TRUE), by=x1] # x1 x2 x3 x4 x5 # 1: C 5 80 0 80 # 2: A 12 192 192 0 # 3: B 11 176 96 80 </code></pre> If you want <code>sum</code> to return <code>NA</code> instead of the sum after removing NA's, just remove the <code>na.rm=TRUE</code> argument. <code>.SD</code> here is an internal <code>data.table</code> variable that constructs, by default, all the columns not in <code>by</code> - here all except <code>x1</code>. You can check the contents of <code>.SD</code> by doing: <pre class="prettyprint"><code>dt[, print(.SD), by=x1] </code></pre> to get an idea of what's <code>.SD</code>. If you're interested check <code>?data.table</code> for other internal (and very useful) special variables like <code>.I</code>, <code>.N</code>, <code>.GRP</code> etc..

How to get all the sum in aggregate function?

Tags:

r

aggregate

Here's some sample data:

dat="x1 x2 x3 x4 x5
1   C  1 16 NA 16
2   A  1 16 16 NA
3   A  1 16 16 NA
4   A  4 64 64 NA
5   C  4 64 NA 64
6   A  1 16 16 NA
7   A  1 16 16 NA
8   A  1 16 16 NA
9   B  4 64 32 32
10  A  3 48 48 NA
11  B  4 64 32 32
12  B  3 48 32 16"

data<-read.table(text=dat,header=TRUE)   
aggregate(cbind(x2,x3,x4,x5)~x1, FUN=sum, data=data)   
 x1 x2  x3 x4 x5   
1  B 11 176 96 8

How do I get the sum of A and C as well in x1?

 aggregate(.~x1, FUN=sum, data=data, na.action = na.omit)  
   x1 x2  x3 x4 x5
 1  B 11 176 96 80

When I use sqldf:

library("sqldf")
sqldf("select sum(x2),sum(x3),sum(x4),sum(x5) from data group by x1")
  sum(x2) sum(x3) sum(x4) sum(x5)
1      12     192     192    <NA>
2      11     176      96      80
3       5      80      NA      80

Why do I get <NA> in the first line, but NA in the third line ? What is the differences between them? Why do I get the <NA>? there is no <NA> in data!

str(data)
'data.frame':   12 obs. of  5 variables:
 $ x1: Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 1 2 1 ...
 $ x2: int  1 1 1 4 4 1 1 1 4 3 ...
 $ x3: int  16 16 16 64 64 16 16 16 64 48 ...
 $ x4: int  NA 16 16 64 NA 16 16 16 32 48 ...
 $ x5: int  16 NA NA NA 64 NA NA NA 32 NA ...

The sqldf problem remains here, why sum(x4) gets NA, on the contrary sum(x5) gets <NA>?

I can prove that all NA both in x4 and x5 is the same this way:

data[is.na(data)] <- 0     

> data
   x1 x2 x3 x4 x5
1   C  1 16  0 16
2   A  1 16 16  0
3   A  1 16 16  0
4   A  4 64 64  0
5   C  4 64  0 64
6   A  1 16 16  0
7   A  1 16 16  0
8   A  1 16 16  0
9   B  4 64 32 32
10  A  3 48 48  0
11  B  4 64 32 32
12  B  3 48 32 16

So the fact that sqldf treats sum(x4) and sum(x5) differently is so strange that I think there is a logical mess in sqldf. It can be reproduced in other pc. Please do first and then have the discussion go on.

596

asked Dec 30 '13 11:12

showkey

2 Answers

Here's the data.table way in case you're interested:

require(data.table)
dt <- data.table(data)
dt[, lapply(.SD, sum, na.rm=TRUE), by=x1]
#    x1 x2  x3  x4 x5
# 1:  C  5  80   0 80
# 2:  A 12 192 192  0
# 3:  B 11 176  96 80

If you want sum to return NA instead of the sum after removing NA's, just remove the na.rm=TRUE argument.

.SD here is an internal data.table variable that constructs, by default, all the columns not in by - here all except x1. You can check the contents of .SD by doing:

dt[, print(.SD), by=x1]

to get an idea of what's .SD. If you're interested check ?data.table for other internal (and very useful) special variables like .I, .N, .GRP etc..

166

answered Oct 25 '22 20:10

Arun

Because of how the formula method for aggregate handles NA values by default, you need to override that before using the na.rm argument from sum. You can do this by setting na.action to NULL or na.pass:

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = NULL)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

aggregate(cbind(x2,x3,x4,x5) ~ x1, FUN = sum, data = data, 
          na.rm = TRUE, na.action = na.pass)
#   x1 x2  x3  x4 x5
# 1  A 12 192 192  0
# 2  B 11 176  96 80
# 3  C  5  80   0 80

Regarding sqldf, it seems like the columns are being cast to different types depending on whether the item in the first row of the first grouping variable is an NA or not. If it is an NA, that column gets cast as character.

Compare:

df1 <- data.frame(id = c(1, 1, 2, 2, 2),
                 A = c(1, 1, NA, NA, NA),
                 B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df1 group by id")
#   sum(A) sum(B)
# 1      2   <NA>
# 2     NA    3.0

df2 <- data.frame(id = c(2, 2, 1, 1, 1),
                  A = c(1, 1, NA, NA, NA),
                  B = c(NA, NA, 1, 1, 1))
sqldf("select sum(A), sum(B) from df2 group by id")
#   sum(A) sum(B)
# 1   <NA>      3
# 2    2.0     NA

However, there is an easy workaround: reassign the original name to the new columns being created. Perhaps that let's SQLite inherit some of the information from the previous database? (I don't really use SQL.)

Example (with the same "df2" created earlier):

sqldf("select sum(A) `A`, sum(B) `B` from df2 group by id")
#    A  B
# 1 NA  3
# 2  2 NA

You can easily use paste to create your select statement:

Aggs <- paste("sum(", names(data)[-1], ") `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192 NA
# 2 11 176  96 80
# 3  5  80  NA 80
str(.Last.value)
# 'data.frame':  3 obs. of  4 variables:
#  $ x2: int  12 11 5
#  $ x3: int  192 176 80
#  $ x4: int  192 96 NA
#  $ x5: int  NA 80 80

A similar approach can be taken if you want NA to be replaced with 0:

Aggs <- paste("sum(ifnull(", names(data)[-1], ", 0)) `", 
              names(data)[-1], "`", sep = "", collapse = ", ")
sqldf(paste("select", Aggs, "from data group by x1"))
#   x2  x3  x4 x5
# 1 12 192 192  0
# 2 11 176  96 80
# 3  5  80   0 80

answered Oct 25 '22 20:10

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                Treat NA as zero only when adding a number
                            
                                Change width of bars in barchart (R)
                            
                                R data - Changing my data frame (converting columns into rows and vice versa)
                            
                                Heatmap or plot for a correlation matrix [duplicate]
                            
                                Generate ggplot2 boxplot with different colours for multiple groups
                            
                                How to spatially separate rug plots from different series
                            
                                Change portion of the background in ggplot be to a different color [duplicate]
                            
                                How NOT to display value 0 in a stacked bar chart using ggplot2
                            
                                How to plot density curves for each column in R?
                            
                                Length of columns excluding NA in r
                            
                                boxplot of vectors with different length
                            
                                Sort vector of integers in specific (custom) order
                            
                                R - fastest way to select the rows of a matrix that satisfy multiple conditions
                            
                                How to obtain all combinations of the columns of a data frame taken by 2?
                            
                                Complex rearrangement of list into matrix
                            
                                r boxplot tilted labels x axis
                            
                                rearrange a data frame by sorting a column within groups
                            
                                Trouble understanding how stack() works
                            
                                Test for Multicollinearity in Panel Data R
                            
                                Combining polygons and calculating their area (i.e. number of cells) in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With