Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use variable names to refer to data frame columns with ddply?

Tags:

r

plyr

I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr.

When I use the name of the column directly with ddply and cumsum I have no problems:

require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
                     by = "month",
                     length.out = 60),
                 sales = runif(60, min = 700, max = 1200))

df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
            cum_sales = (cumsum(as.numeric(sales))))

This is all well and good but the ultimate aim is to be able to pass a column name to this function. When I try to use a variable in place of the column name, it doesn't work as I expected:

mycol <- "sales"
df[mycol]

df <- ddply(df, .(year), transform,
            cum_value2 = cumsum(as.numeric(df[mycol])))

I thought I knew how to access columns by name. This worries me because it suggests that I have failed to understand something basic about indexing and extraction. I would have thought that referring to columns by name in this way would be a common need.

I have two questions.

  1. What am I doing wrong i.e. what have I misunderstood?
  2. Is there a better way of going about this, bearing in mind that the names of the columns will not be known beforehand by the function?

TIA

like image 792
SlowLearner Avatar asked Jan 15 '12 10:01

SlowLearner


2 Answers

The arguments to ddply are expressions which are evaluated in the context of the each part the original data frame is split into. Your df[myval] addresses the whole data frame, so you cannot pass it as-is (btw, why do you need those as.numeric(as.character()) stuff - they are completely useless).

The easiest way will be to write your own function which will does everything inside and pass the column name down, e.g.

df <- ddply(df, 
            .(year), 
            .fun = function(x, colname) transform(x, cum_sales = cumsum(x[,colname])), 
            colname = "sales")
like image 133
Anton Korobeynikov Avatar answered Sep 28 '22 16:09

Anton Korobeynikov


The problem is that ddply expects its last arguments to be expressions, that will be evaluated on chunks of the data.frame (every year, in your example). If you use df[myval], you have the whole data.frame, not the annual chunks.

The following works, but is not very elegant: I build the expression as a string, and then convert it with eval(parse(...)).

ddply( df, .(year), transform, 
  cum_value2 = eval(parse( text = 
    sprintf( "cumsum(as.numeric(as.character(%s)))", mycol )
  ))
)
like image 40
Vincent Zoonekynd Avatar answered Sep 28 '22 14:09

Vincent Zoonekynd