x
Team Date Score
A 1-1-2012 80
A 1-2-2012 90
A 1-3-2012 50
A 1-4-2012 40
B 1-1-2012 100
B 1-2-2012 60
B 1-3-2012 30
B 1-4-2012 70
etc
I need to and can turn this data frame to wide data frame one row for each team with all the observations and dates as the heading:
xx
Team 1-1-2012 1-2-2012 1-3-2012 1-4-2012
A 80 90 50 40
B 100 60 30 70
I need to calculate the mean and sd for each row, which I can do:
xx
Team 1-1-2012 1-2-2012 1-3-2012 1-4-2012 mean sd
A 80 90 50 40 75 20
B 100 60 30 70 55 10
Considering I have thousands of row in data frame xx. I would like to do calculation on each cell as this:
if abs(xx-Mean) > 3*SD, create a counter column name and increment the value. The idea is that compare each observation against the mean and sd, if each observation for a given team matches this - abs(xx-Mean) > 3*SD, increment the counter. After checking each cell, I would like to look at each counter for each team and get the top ten high team that has the highest counter value. Basically I am trying to detect the most outliers. Once I get the top 10 team names, I would like to graph their time series data on data frame x.
I hope I am not making this more complicated than it should be. Not sure, R already has function to do calculations on each cell. Any ideas how to accomplish this is appreciated?
I would leave your data in long format and use plyr
, data.table
, or any of the other split-apply-combine tools to compute your statistics. Here's how I'd use plyr
for the task:
#Your data
dat <- read.table(text = "Team Date Score
A 1-1-2012 80
A 1-2-2012 90
A 1-3-2012 50
A 1-4-2012 40
B 1-1-2012 100
B 1-2-2012 60
B 1-3-2012 30
B 1-4-2012 70", header = TRUE)
library(plyr)
#Compute mean and sd by team
dat <- ddply(dat, .(Team), transform, mean = mean(Score), sd = sd(Score))
#Your outlier threshold
dat <- transform(dat, outlier = abs(Score - mean) > 3*sd)
#Cumulative sum by team
dat <- ddply(dat, .(Team), transform, cumsumOutlier = cumsum(outlier))
Gives you this as an output (which does not match your example, but presumably your real data does):
Team Date Score mean sd outlier cumsumOutlier
1 A 1-1-2012 80 65 23.80476 FALSE 0
2 A 1-2-2012 90 65 23.80476 FALSE 0
3 A 1-3-2012 50 65 23.80476 FALSE 0
4 A 1-4-2012 40 65 23.80476 FALSE 0
5 B 1-1-2012 100 65 28.86751 FALSE 0
6 B 1-2-2012 60 65 28.86751 FALSE 0
7 B 1-3-2012 30 65 28.86751 FALSE 0
8 B 1-4-2012 70 65 28.86751 FALSE 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With