Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum object in a column between an interval defined by another column

Tags:

r

I have a data frame like this:

  df
       A B
 [1,]  1 4
 [2,]  2 3
 [3,]  3 5
 [4,]  4 7
 [5,]  5 5
 [6,]  6 7
 [7,]  7 4
 [8,]  8 7
 [9,]  9 3
[10,] 10 7

I need to sum the numbers in column B that fall between a certain interval defined by column A. For example sum the value in B between A≥1 and A<3.

I'm trying to use this:

> sum(which(df$B[df$A>=1] & df$B[df$A<3]))

but it does not give me what I want.

like image 295
matteo Avatar asked May 05 '11 11:05

matteo


2 Answers

You're almost there.

with(my.df, sum(my.df[A >= 1 & A < 3, "B"]))

EDIT

Chase challenged me to explain away the code at hand. When reading R code, it's best to read from inwards out. Here we go.

my.df is a data.frame (think Excel sheet) with columns A and B. Square brackets [] are used to subset anything from this object like so: [rows, columns]. For example, [1, ] would return the entire first row, and if you add a column number (or column name), you get value in first row in that column (e.g. [1, 2], where you would get value in the first row of the second column). We will now subset rows in my.df with A >= 1 & A < 3. What we're saying here is we want to see those rows that have values in A bigger or equal to 1 and smaller than 3. This will give us all rows that satisfy this criterion. If we add , "B" it means we want to select column B. Since we already subsetted the result to contain only rows that fit the above criterion, by entering column name B we get values only in the column. Once you have those values from column B you sum them using sum(). Use of with function is there to make our life easier. If we hadn't used that, we would be forced to call columns by their full name my.df$A and my.df$B.

like image 111
Roman Luštrik Avatar answered Nov 15 '22 10:11

Roman Luštrik


Another option using the select parameter from subset achieves identical results. : sum(subset(dat, A < 3 & A >= 1, select = "B"))

Roman's answer is faster, but arguably not as easy to deparse to the non R-expert. Depending on your audience (i.e. sharing script with colleagues) and performance constraints, one may be preferred to the other.

> system.time(replicate(100,with(dat, sum(dat[A >= 1 & A < 3, "B"]))))
   user  system elapsed 
  0.008   0.000   0.008 
> system.time(replicate(100, sum(subset(dat, A < 3 & A >= 1, select = "B"))))
   user  system elapsed 
  0.041   0.000   0.041 
like image 36
Chase Avatar answered Nov 15 '22 10:11

Chase