(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y
only using Z==z
by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found .
passes entire tibble and the code above returns the same values for all groups.
Group_by () function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by () function alone will not give any output.
If we want to print our complete tibble to the console, we can simply use the print function in combination with the nrow function. Have a look at the following R syntax:
We’ll start by loading dplyr: The most important grouping verb is group_by (): it takes a data frame and one or more variables to group by: You can see the grouping when you print the data: Or use tally () to count the number of rows in each group. The sort argument is useful if you want to see the largest groups up front.
Most tibble (previously called tbl_df) operations are based on the dplyr package. Let’s install and load the dplyr package to R: Now, we can convert the iris data frame to a tibble with the as_tibble function of the dplyr package:
There are a couple things you can simplify. One is in your function: since you're passing in a value z
to the function, you don't need to use get("z")
. You have a z
in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val
.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do
, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val
to the argument z
.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do
is being deprecated in favor of purrr::map
, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data
, then map over that column and again supply z = z_val
. When you unnest the y_hat
column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0
to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to @patL's answer, you can also wrap the tidyverse
solution within purrr:map
to return a list
of two tibble
s, one for each z
value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With