I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr.
When I use the name of the column directly with ddply and cumsum I have no problems:
require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
by = "month",
length.out = 60),
sales = runif(60, min = 700, max = 1200))
df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
cum_sales = (cumsum(as.numeric(sales))))
This is all well and good but the ultimate aim is to be able to pass a column name to this function. When I try to use a variable in place of the column name, it doesn't work as I expected:
mycol <- "sales"
df[mycol]
df <- ddply(df, .(year), transform,
cum_value2 = cumsum(as.numeric(df[mycol])))
I thought I knew how to access columns by name. This worries me because it suggests that I have failed to understand something basic about indexing and extraction. I would have thought that referring to columns by name in this way would be a common need.
I have two questions.
TIA
The arguments to ddply are expressions which are evaluated in the context of the each part the original data frame is split into. Your df[myval] addresses the whole data frame, so you cannot pass it as-is (btw, why do you need those as.numeric(as.character()) stuff - they are completely useless).
The easiest way will be to write your own function which will does everything inside and pass the column name down, e.g.
df <- ddply(df,
.(year),
.fun = function(x, colname) transform(x, cum_sales = cumsum(x[,colname])),
colname = "sales")
The problem is that ddply
expects its last arguments to be expressions, that will be evaluated on chunks of the data.frame (every year, in your example).
If you use df[myval]
, you have the whole data.frame, not the annual chunks.
The following works, but is not very elegant: I build the expression as a string, and then convert it with eval(parse(...))
.
ddply( df, .(year), transform,
cum_value2 = eval(parse( text =
sprintf( "cumsum(as.numeric(as.character(%s)))", mycol )
))
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With