Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: passing a grouped tibble to a custom function

Tags:

r

dplyr

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.

> data
   village     A     Z      Y 
     <chr> <int> <int>   <dbl> 
 1       a     1     1   500     
 2       a     1     1   400     
 3       a     1     0   800  
 4       b     1     0   300  
 5       b     1     1   700  

For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".

Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.

z <- 1 # z takes 0 or 1
data %>%
    group_by(village) %>% # grouping by village
    summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable

Y_hat_village <- function(data_village, z){
    # This function takes a part of tibble (`data_village`) and a variable `z`
    # Calculate the mean for a specific z in a village
    data_z <- data_village %>% filter(Z==get("z"))
    return(mean(data_z$Y))
}

However, I found . passes entire tibble and the code above returns the same values for all groups.

like image 575
user2978524 Avatar asked Jun 19 '18 12:06

user2978524


People also ask

What is the use of dplyr group_by () in R?

Group_by () function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by () function alone will not give any output.

How to print a Tibble to console in R?

If we want to print our complete tibble to the console, we can simply use the print function in combination with the nrow function. Have a look at the following R syntax:

How do I Group data in Python dplyr?

We’ll start by loading dplyr: The most important grouping verb is group_by (): it takes a data frame and one or more variables to group by: You can see the grouping when you print the data: Or use tally () to count the number of rows in each group. The sort argument is useful if you want to see the largest groups up front.

How to convert iris data frame to Tibble in R?

Most tibble (previously called tbl_df) operations are based on the dplyr package. Let’s install and load the dplyr package to R: Now, we can convert the iris data frame to a tibble with the as_tibble function of the dplyr package:


2 Answers

There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.

library(tidyverse)

z_val <- 1

Y_hat_village2 <- function(data, z) {
  data_z <- data %>% filter(Z == z)
  return(mean(data_z$Y))
}

You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.

df %>%
  group_by(village) %>%
  do(y_hat = Y_hat_village2(., z = z_val)) %>%
  unnest()
#> # A tibble: 2 x 2
#>   village y_hat
#>   <chr>   <dbl>
#> 1 a         450
#> 2 b         700

However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.

df %>%
  group_by(village) %>%
  nest() %>%
  mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
  unnest(y_hat)
#> # A tibble: 2 x 3
#>   village data             y_hat
#>   <chr>   <list>           <dbl>
#> 1 a       <tibble [3 × 3]>   450
#> 2 b       <tibble [2 × 3]>   700

Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.

df %>%
  group_by(village) %>%
  nest() %>%
  mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
  unnest(y_hat)
#> # A tibble: 2 x 3
#>   village data             y_hat
#>   <chr>   <list>           <dbl>
#> 1 a       <tibble [3 × 3]>   800
#> 2 b       <tibble [2 × 3]>   300
like image 93
camille Avatar answered Nov 01 '22 19:11

camille


As an extension/modification to @patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:

z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
#  village Y.mean
#  <fct>    <dbl>
#1 a         800.
#2 b         300.
#
#[[2]]
## A tibble: 2 x 2
#  village Y.mean
#  <fct>    <dbl>
#1 a         450.
#2 b         700.

Sample data

df <- read.table(text =
    "  village     A     Z      Y
 1       a     1     1   500
 2       a     1     1   400
 3       a     1     0   800
 4       b     1     0   300
 5       b     1     1   700  ", header = T)
like image 37
Maurits Evers Avatar answered Nov 01 '22 21:11

Maurits Evers