Use of ddply + mutate with a custom function?

Tags:

I use ddply quite frequently, but historically with summarize (occasionally mutate) and only basic functions like mean(), var1 - var2, etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply. I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions.

Related

Custom Function not recognized by ddply {plyr}...
How do I pass variables to a custom function in ddply?
r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this)

Here's an example data set:

library(plyr)
df <- data.frame(id = rep(letters[1:3], each = 3),
                 value = 1:9)

Normally, I'd use ddply like so:

df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))

My visualization of this is that ddply splits df into "mini" data frames based on grouped combos of id, and then I add a new column by calling mean() on a column name that exists in df. So, my attempt to implement a function extended this idea:

# actually, my logical extension of the above was to use:
# ddply(..., mean = function(value) { mean(value) })
df_ply_2 <- ddply(df, .(id), mutate,
                  mean = function(df) { mean(df$value) })

Error: attempt to replicate an object of type 'closure'

All the help on custom functions don't apply mutate, but that seems inconsistent, or at least annoying to me, as the analog to my implemented solution is:

df_mean <- function(df) {
    temp <- data.frame(mean = rep(mean(df$value), nrow(df)))
    temp
}

df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean

In-line, looks like I have to do this:

df_ply_4 <- df
df_ply_4$mean <- ddply(df, .(id), function(x) {
    temp <- data.frame(mean = rep(mean(x$value), length(x$value)))
    temp})$mean

Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?

Thanks for helping me "get it"!

Update after @Gregor's answer

Awesome answer, and I think I now get it. I was, indeed, confused about what mutate and summarize meant... thinking they were arguments to ddply regarding how to handle the result vs. actually being the functions themselves. So, thanks for that big insight.

Also, it really helped to understand that without mutate/summarize, I need to return a data.frame, which is the reason I have to cbind a column with the name of the column in the df that gets returned.

Lastly if I do use mutate, it's helpful to now realize I can return a vector result and get the right result. Thus, I can do this, which I've now understood after reading your answer:

# I also caught that the code above doesn't do the right thing
# and recycles the single value returned by mean() vs. repeating it like
# I expected. Now that I know it's taking a vector, I know I need to return
# a vector the same length as my mini df
custom_mean <- function(x) {
    rep(mean(x), length(x))
}

df_ply_5 <- ddply(df, .(id), mutate,
              mean = custom_mean(value))

Thanks again for your in-depth answer!

Update per @Gregor's last comment

Hmmm. I used rep(mean(x), length(x)) due to this observation for df_ply_3's result (I admit to not actually looking at it closely when I ran it the first time making this post, I just saw that it didn't give me an error!):

df_mean <- function(x) {
    data.frame(mean = mean(x$value))
}

df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean

df_ply_3
  id value mean
1  a     1    2
2  a     2    5
3  a     3    8
4  b     4    2
5  b     5    5
6  b     6    8
7  c     7    2
8  c     8    5
9  c     9    8

So, I'm thinking that my code was actually an accident based on the fact that I had 3 id variables repeated 3 times. Thus the actual return was the equivalent of summarize (one row per id value), and recycled. Testing that theory appears accurate if I update my data frame like so:

df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"),
                 value = 1:10)

I get an error when trying to use the df_ply_3 method with df_mean():

Error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) : 
  replacement has 4 rows, data has 10

So, the mini df passed to df_mean returns a df where mean is the result of taking the mean if the value vector (returns one value). So, my output was just a data.frame of three values, one per id group. I'm thinking the mutate way sort of "remembers" that it was passed a mini data frame, and then repeats the single output to match it's length?

In any case, thanks for commenting on df_ply_5; indeed, if I remove the rep() bit and just return mean(x), it works great!

543

asked Nov 14 '14 16:11

Hendy

1 Answers

You're mostly right. ddply indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.

With ddply, all the work is done with data frames, so the .fun argument must take a (mini) data frame as input and return a data frame as output.

mutate and summarize are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply to see this, e.g.

mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))

If you don't use mutate or summarize, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.

If you do use mutate or summarize, any other functions you pass to ddply aren't used by ddply, they're just passed on to be used by mutate or summarize. And functions used by mutate and summarize act on the columns of the data, not on the entire data.frame. This is why

ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))

Notice that we don't pass mutate a function. We don't say ddply(mtcars, "cyl", mutate, mean). We have to tell it what to take the mean of. In ?mutate, the description of ... is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean() really different from any "custom function"? No.)

Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.

custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))

This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate or summarize, you have to give the other functions arguments; you're not just passing the functions.

You seem to want to pass ddply a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate or summarize, but you can hack your own version. For summarize-like behavior, return a data.frame with a single value, for mutate-like behavior, return the original data.frame with your extra value cbinded on

mean.mpg.mutate = function(df) {
    cbind.data.frame(df, mean.mpg = mean(df$mpg))
}

mean.mpg.summarize = function(df) {
    data.frame(mean.mpg = mean(df$mpg))
}

ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)

tl;dr

Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?

Quite the opposite! mutate and summarize take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.

Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply.

If you don't use mutate/summarize, then your function needs to take and return a data frame.

If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean; you pass an expression, like mean(mpg).

What about `dplyr`?

This was written before dplyr was a thing, or at least a big thing. dplyr removes a lot of the confusion from this process because it essentially replaces the nesting of ddply with mutate or summarize as arguments with sequential functions group_by followed by mutate or summarize. The dplyr version of my answer would be

library(dplyr)
group_by(mtcars, cyl) %>%
    mutate(mean.mpg = mean(mpg))

With the new column creation passed directly to mutate (or summarize), there isn't confusion about which function does what.

answered Sep 21 '22 22:09

Gregor Thomas

Related questions
                            
                                Remove consecutive duplicate entries
                            
                                Make a boxplot without whiskers
                            
                                What is the fastest way to obtain frequencies of integers in a vector?
                            
                                Storing results of loop iterations in R
                            
                                converting numbers to time
                            
                                R ave by columns
                            
                                glmer - predict with binomial data (cbind count data)
                            
                                Import text file using ff package
                            
                                How to produce a meaningful draftsman/correlation plot for discrete values
                            
                                element as the list names and list name as the element in a list?
                            
                                Imputing missing values linearly in R
                            
                                How can I find the index of all NA in a dataframe column?
                            
                                R 3.0.3 rbind multiple csv files
                            
                                Why do variable lookups in the body of function A take values from the global environment but not function B that calls A?
                            
                                R : function to generate a mixture distribution
                            
                                How do I split a data frame based on range of column values in R?
                            
                                How to compute residuals of a point process in python
                            
                                Using ifelse() to replace NAs in one data frame by referencing another data frame of different length
                            
                                GitHub displays all code chunks from README.rmd (despite include=FALSE)
                            
                                R removing items in a sublist from a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use of ddply + mutate with a custom function?

Tags:

r

plyr

Hendy

People also ask

1 Answers

tl;dr

What about `dplyr`?

Gregor Thomas

Recent Activity

Donate For Us

Use of ddply + mutate with a custom function?

Tags:

r

plyr

Hendy

People also ask

1 Answers

tl;dr

What about dplyr?

Gregor Thomas

Related questions

Recent Activity

Donate For Us

What about `dplyr`?