Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create multiple list columns from data columns of a nested data frame

Tags:

r

dplyr

purrr

The goal is to create multiple list columns from data columns of a nested data frame. The following code achieves that goal. However, the code is quite long and I wonder if there is a possibility to shorten it by using tidyverse tools (dplyr, purrr etc.). In a non-nested data frame I would use, e. g., dplyr's across().

# R version 3.6.1

library(dplyr) # 1.0.7
library(tidyr) # 1.2.0


df_distribution <- iris %>% 
  dplyr::group_by(Species) %>% 
  tidyr::nest() %>% 
  dplyr::mutate(Sepal.Length = purrr::map(data, ~ dplyr::select(.x, Sepal.Length) %>% 
                                            dplyr::group_by(Sepal.Length) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Sepal.Width  = purrr::map(data, ~ dplyr::select(.x, Sepal.Width) %>% 
                                            dplyr::group_by(Sepal.Width) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Petal.Length = purrr::map(data, ~ dplyr::select(.x, Petal.Length) %>% 
                                            dplyr::group_by(Petal.Length) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) ) %>% 
  dplyr::mutate(Petal.Width  = purrr::map(data, ~ dplyr::select(.x, Petal.Width) %>% 
                                            dplyr::group_by(Petal.Width) %>% 
                                            dplyr::summarise(n = n() ) %>% 
                                            dplyr::mutate(perc = n / sum(n) ) %>% 
                                            dplyr::select(-n) ) )

My ultimate goal is to use the created empirical distributions to randomly draw from them. However, that step is not part of the provided code but I would appreciate any pointer to helpful ressources for that, too.

like image 401
MatSchu Avatar asked Mar 06 '26 18:03

MatSchu


1 Answers

We could make a custom function to use within purrr::map:

library(dplyr)
library(tidyr)
library(purrr)

f <- function(data, col) {
  data %>%
    group_by({{ col }}) %>%
    summarise(n = n()) %>%
    mutate(perc = n / sum(n)) %>%
    select(-n)
  }

df_distributionNew <- iris %>%
  group_by(Species) %>%
  nest() %>%
  mutate(
    Sepal.Length = map(data, ~ f(.x, Sepal.Length)),
    Sepal.Width = map(data, ~ f(.x, Sepal.Width)),
    Petal.Length = map(data, ~ f(.x, Petal.Length)),
    Petal.Width = map(data, ~ f(.x, Petal.Width))
    )

identical(df_distribution, df_distributionNew)
# [1] TRUE

There is still a repetition within mutate, not sure how to fix that.

like image 141
zx8754 Avatar answered Mar 08 '26 07:03

zx8754



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!