I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.
As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:
data(iris)
lm.func = function(.data){
lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
out = summary(lm.fit)$coefficients[2,1]
return(out)
}
Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:
iris %>% filter(Species == "setosa") %>% lm.func(.)
But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:
iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))
Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.
To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.
Is it possible to use summarize() in this way?
R – Summary of Data Frame To get the summary of Data Frame, call summary() function and pass the Data Frame as argument to the function. We may pass additional arguments to summary() that affects the summary output. The output of summary() contains summary for each column.
All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr.
The summarize() function is used in the R program to summarize the data frame into just one value or vector. This summarization is done through grouping observations by using categorical values at first, using the groupby() function. The dplyr package is used to get the summary of the dataset.
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
You can try with do
iris %>%
group_by(Species) %>%
do(data.frame(coef.val=lm.func(.)))
# Species coef.val
#1 setosa 0.2012451
#2 versicolor 0.3310536
#3 virginica 0.1602970
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With