I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in "at batts" by league and team (lg and team)?
library(plyr)
df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
After doing a little aggregating to simplify the data fame, the data looks like this:
head(df1)
year lg team ab
1884 UA ALT 108
1997 AL ANA 1703
1998 AL ANA 1502
1999 AL ANA 660
2000 AL ANA 85
2001 AL ANA 219
I would like to end up with someting like this
year lg team ab yoy
1997 AL ANA 1703 NA
1998 AL ANA 1502 -201
1999 AL ANA 660 -842
2000 AL ANA 85 -575
2001 AL ANA 219 134
I started by writign the following function, which I think is wrong:
yoy.func <- function(df) {
lag <- c(df$ab[-1],0)
cur <- c(df$ab[1],0)
df$yoy <- cur -lag
return(df)
}
Without sucess, I used the following code to attempt return the yoy change.
df2 <- ddply(df1, .(lg, team), yoy.func)
Any guidance woud be appreciated.
Thanks
I know you asked for a "plyr"-specific solution, but for the sake of sharing, here is an alternative approach in base R. In my opinion, I find the base R approach just as "readable". And, at least in this particular case, it's a lot faster!
output <- within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
head(output)
# year lg team ab yoy
# 1 1884 UA ALT 108 NA
# 2 1997 AL ANA 1703 NA
# 3 1998 AL ANA 1502 -201
# 4 1999 AL ANA 660 -842
# 5 2000 AL ANA 85 -575
# 6 2001 AL ANA 219 134
library(rbenchmark)
benchmark(DDPLY = {
ddply(df1, .(team, lg), mutate ,
yoy = c(NA, diff(ab)))
}, WITHIN = {
within(df1, {
yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
})
}, columns = c("test", "replications", "elapsed",
"relative", "user.self"))
# test replications elapsed relative user.self
# 1 DDPLY 100 10.675 4.974 10.609
# 2 WITHIN 100 2.146 1.000 2.128
data.table
If your data are very large, check out data.table
. Even with this example, you'll find a good speedup in relative terms. Plus the syntax is super compact and, in my opinion, easily readable.
library(plyr)
df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
library(data.table)
DT <- data.table(df1)
DT
# year lg team ab
# 1: 1884 UA ALT 108
# 2: 1997 AL ANA 1703
# 3: 1998 AL ANA 1502
# 4: 1999 AL ANA 660
# 5: 2000 AL ANA 85
# ---
# 2523: 1895 NL WSN 839
# 2524: 1896 NL WSN 982
# 2525: 1897 NL WSN 1426
# 2526: 1898 NL WSN 1736
# 2527: 1899 NL WSN 787
Now, look at this concise solution:
DT[, yoy := c(NA, diff(ab)), by = "team,lg"]
DT
# year lg team ab yoy
# 1: 1884 UA ALT 108 NA
# 2: 1997 AL ANA 1703 NA
# 3: 1998 AL ANA 1502 -201
# 4: 1999 AL ANA 660 -842
# 5: 2000 AL ANA 85 -575
# ---
# 2523: 1895 NL WSN 839 290
# 2524: 1896 NL WSN 982 143
# 2525: 1897 NL WSN 1426 444
# 2526: 1898 NL WSN 1736 310
# 2527: 1899 NL WSN 787 -949
How about using diff():
df <- read.table(header = TRUE, text = ' year lg team ab
1884 UA ALT 108
1997 AL ANA 1703
1998 AL ANA 1502
1999 AL ANA 660
2000 AL ANA 85
2001 AL ANA 219')
require(plyr)
ddply(df, .(team, lg), mutate ,
yoy = c(NA, diff(ab)))
# year lg team ab yoy
1 1884 UA ALT 108 NA
2 1997 AL ANA 1703 NA
3 1998 AL ANA 1502 -201
4 1999 AL ANA 660 -842
5 2000 AL ANA 85 -575
6 2001 AL ANA 219 134
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With