llply operations on multiple dataframes

Question

Is there a simple way (i.e., without having to use "for" loops) to do the following:

I have a couple data frames. I want to use a plyr operation to summarize them. In this example, I have two data frames, east and west, and I want to summarize both of them with spend and trials by country.

Here's the example data frames:

west <- data.frame(
    spend = sample(50:100,50,replace=T),
    trials = sample(100:200,50,replace=T),
    country = sample(c("usa","canada","uk"),50,replace = T)
    )

east <- data.frame(
    spend = sample(50:100,50,replace=T),
    trials = sample(100:200,50,replace=T),
    country = sample(c("china","japan","skorea"),50,replace = T)
    )

and the combined list of both dataframes:

combined <- c(west,east)

What I want to do is a ddply-type operation on both of these dataframes at the same time, and have the output be a list (at least that seems most straightforward). For example, if I were just operating on one dataframe, it would be something like:

country.df <- ddply(west, .(country), summarise,
    spend = sum(spend),
    trials = sum(trials)
)

But I want to do this at scale. I tried using similar syntax in the llply argument but that doesn't work (I have a feeling I'm missing something painfully obvious):

countries.list <- llply(combined, .(country), summarise,
    spend = sum(spend),
    trials = sum(trials)
)

That returns the error: "Error in FUN(X[[1L]], ...) : attempt to apply non-function"

... I can think of a way to do this by writing a function, then passing that through to an apply argument. But it seems like llply should be able to handle this "out of the box" since it's a fairly straightforward use of what the tool does.

What am I missing here?

Ramnath · Accepted Answer

Here is another solution that makes use of dplyr, which is a highly optimized version of plyr for data frames. dplyr syntax is very intuitive and IMHO a lot more readable than plyr. It wouldn't be an exaggaration to say that it reads more like poetry (at least to my eyes :) )

combine = list(west = west, east = east)
library(dplyr)
lapply(combined, function(dat){
   dat %.%
     group_by(country) %.%
     summarise(
       trials = sum(trials),
       spend = sum(spend)
     ) %.%
     mutate(
       status = ifelse(trials < 1000, "Good", "Bad")
     )
})

EDIT. For completeness, here is the data.table solution. Note that for large data frames, dplyr and data.table will eat plyr for lunch :)

library(data.table)
lapply(combined, function(dat){
  data.table(dat)[
  , list(trials = sum(trials), spend = sum(spend)),country][
  , status := ifelse(trials < 1000, "Good", "Bad")]
})

UPDATE 2: Here is a more consise version of dplyr solution

lapply(combined, chain, group_by(country),
  summarise(trials = sum(trials), spend = sum(spend)),
  mutate(status = ifelse(trials < 1000, "Good", "Bad"))
)

llply operations on multiple dataframes

Tags:

r

Marc Tulla

1 Answers

Ramnath

Recent Activity

Donate For Us

llply operations on multiple dataframes

Tags:

r

Marc Tulla

1 Answers

Ramnath

Related questions

Recent Activity

Donate For Us