Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr summarize with a function of a dataframe

Tags:

r

group-by

dplyr

I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.

As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:

data(iris)
lm.func = function(.data){
  lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
  out = summary(lm.fit)$coefficients[2,1]
  return(out)
}

Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:

iris %>% filter(Species == "setosa") %>% lm.func(.)

But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:

iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))

Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.

To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.

Is it possible to use summarize() in this way?

like image 793
Mark T Patterson Avatar asked Mar 28 '15 15:03

Mark T Patterson


People also ask

How do you summarize data in a DataFrame in R?

R – Summary of Data Frame To get the summary of Data Frame, call summary() function and pass the Data Frame as argument to the function. We may pass additional arguments to summary() that affects the summary output. The output of summary() contains summary for each column.

Does dplyr work with data frame?

All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr.

How do I summarize a function in R?

The summarize() function is used in the R program to summarize the data frame into just one value or vector. This summarization is done through grouping observations by using categorical values at first, using the groupby() function. The dplyr package is used to get the summary of the dataset.

What does %>% do in dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).


1 Answers

You can try with do

 iris %>% 
      group_by(Species) %>%
      do(data.frame(coef.val=lm.func(.)))
 #     Species  coef.val
 #1     setosa 0.2012451
 #2 versicolor 0.3310536
 #3  virginica 0.1602970
like image 61
akrun Avatar answered Oct 10 '22 03:10

akrun