I'm trying to create a conditional sum, in order to calculate an average. The idea is that an function (or an apply statement) checks if an certain value is true (for example x > 0), then sums all the values of x that where higher than zero. The last step would be to divide this sum by the number of instances which are greater than zero. Searching for conditonal sum(ming) didn't gave me usable information.
This is a part of the data:
> tmpData
Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT -206
16 KFT 107
Of the functions that I've tried, the following holds the most promise:
avgProfit <- function(x) {
ifelse(x > 0,
sum(x) / length(which(x > 0)),
return(0))
}
However, the output of this function is 0:
> with(tmpData, tapply(TradeResult.Currency., Instrument, avgProfit))
JPM KFT
0 0
> avgProfit(tmpData$TradeResult.Currency.)
[1] 0
> x
[1] 1 1 2 1 2 3 3 3 4 4
(The values should be 225 for JPM (total of 900 divided by 4 instances which where greater than zero) and 116 for KFT)
Even though I calculate the sum of x (which, if I understand correctly, should be the sum of the individual values in the data.frame) in the function, the output of the variable 'x' puzzles me. I can't find where these 1,2,3 and fours are coming from.
How can I calculate an conditional sum? Besides, do I need to use an function or am I making it too complicated (perhaps there is an build-in R function for this which I overlooked?)
Any thoughts are more than welcome,
Regards,
probably it is easy way to drop unused rows first and then aggregate them:
aggregate(TradeResult.Currency.~Instrument,
mean,
data=subset(tmpData,TradeResult.Currency.>0))
You were almost there, I think ifelse was the wrong direction since you want the mean, not an element-wise comparison.
You will want to consider whether you might encounter missing values, so you can handle that properly.
tmpData <- read.table(textConnection(" Instrument TradeResult.Currency.
1 JPM -3
2 JPM 264
3 JPM 284
4 JPM 69
5 JPM 283
11 KFT -8
12 KFT -48
13 KFT 125
14 KFT -150
15 KFT -206
16 KFT 107"))
with(tmpData, tapply(TradeResult.Currency., Instrument, function(x) mean(x[x > 0])))
JPM KFT 225 116
There have been quite a few of these data aggregation / conditional analysis question lately. It's always interesting to see the different approaches used. I thought I would add in an approach using plyr
. I like the plyr
functions because they provide a standard syntax for all functions and allow you to specify structure of both the input and the output. Here we will use ddply
since we are passing in a data.frame
and want a data.frame
back out on the other side. We use the summarise
function to calculate the mean for each instrument where the values are positive.
library(plyr)
ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0]))
To follow up on @Joris performance comparison, ddply
seems to perform as well if not better than other approaches:
> tmpData <- data.frame(
+ instrument = rep(c("JPM","KFT"),each=10e6),
+ TCurr = runif(20e6,-10,100)
+ )
>
> system.time(
+ ddply(tmpData, .(instrument), summarise, avgProfit = mean(TCurr[TCurr > 0]))
+ )
user system elapsed
4.43 0.89 5.32
>
> avgProfit <- function(x) { mean(x[x>0])}
>
> system.time(
+ with(tmpData,tapply(TCurr,instrument,avgProfit))
+ )
user system elapsed
7.88 0.47 8.36
>
> system.time(
+ aggregate(TCurr~instrument,mean,data=subset(tmpData,TCurr>0))
+ )
user system elapsed
28.29 2.35 30.65
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With