Is there a simple way (i.e., without having to use "for" loops) to do the following:
I have a couple data frames. I want to use a plyr operation to summarize them. In this example, I have two data frames, east and west, and I want to summarize both of them with spend and trials by country.
Here's the example data frames:
west <- data.frame(
spend = sample(50:100,50,replace=T),
trials = sample(100:200,50,replace=T),
country = sample(c("usa","canada","uk"),50,replace = T)
)
east <- data.frame(
spend = sample(50:100,50,replace=T),
trials = sample(100:200,50,replace=T),
country = sample(c("china","japan","skorea"),50,replace = T)
)
and the combined list of both dataframes:
combined <- c(west,east)
What I want to do is a ddply-type operation on both of these dataframes at the same time, and have the output be a list (at least that seems most straightforward). For example, if I were just operating on one dataframe, it would be something like:
country.df <- ddply(west, .(country), summarise,
spend = sum(spend),
trials = sum(trials)
)
But I want to do this at scale. I tried using similar syntax in the llply argument but that doesn't work (I have a feeling I'm missing something painfully obvious):
countries.list <- llply(combined, .(country), summarise,
spend = sum(spend),
trials = sum(trials)
)
That returns the error: "Error in FUN(X[[1L]], ...) : attempt to apply non-function"
... I can think of a way to do this by writing a function, then passing that through to an apply argument. But it seems like llply should be able to handle this "out of the box" since it's a fairly straightforward use of what the tool does.
What am I missing here?
Here is another solution that makes use of dplyr
, which is a highly optimized version of plyr
for data frames. dplyr
syntax is very intuitive and IMHO a lot more readable than plyr
. It wouldn't be an exaggaration to say that it reads more like poetry (at least to my eyes :) )
combine = list(west = west, east = east)
library(dplyr)
lapply(combined, function(dat){
dat %.%
group_by(country) %.%
summarise(
trials = sum(trials),
spend = sum(spend)
) %.%
mutate(
status = ifelse(trials < 1000, "Good", "Bad")
)
})
EDIT. For completeness, here is the data.table
solution. Note that for large data frames, dplyr
and data.table
will eat plyr
for lunch :)
library(data.table)
lapply(combined, function(dat){
data.table(dat)[
, list(trials = sum(trials), spend = sum(spend)),country][
, status := ifelse(trials < 1000, "Good", "Bad")]
})
UPDATE 2: Here is a more consise version of dplyr
solution
lapply(combined, chain, group_by(country),
summarise(trials = sum(trials), spend = sum(spend)),
mutate(status = ifelse(trials < 1000, "Good", "Bad"))
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With