Consider this example <pre class="prettyprint"><code>mydata <- data_frame(ind_1 = c(NA,NA,3,4), ind_2 = c(2,3,4,5), ind_3 = c(5,6,NA,NA), y = c(28,34,25,12), group = c('a','a','b','b')) > mydata # A tibble: 4 x 5 ind_1 ind_2 ind_3 y group <dbl> <dbl> <dbl> <dbl> <chr> 1 NA 2 5 28 a 2 NA 3 6 34 a 3 3 4 NA 25 b 4 4 5 NA 12 b </code></pre> Here I want, for each <code>group</code>, regress <code>y</code> on whatever variable is not missing in that group, and store the corresponding <code>lm</code> object in a <code>list-column</code>. That is: <ul> <li>for group <code>a</code>, these variables correspond to <code>ind_2</code> and <code>ind_3</code> </li> <li>for group <code>b</code>, they correspond to <code>ind_1</code> and <code>ind_2</code> </li> </ul> I tried the following but this does not work <pre class="prettyprint"><code>mydata %>% group_by(group) %>% nest() %>% do(filtered_df <- . %>% select(which(colMeans(is.na(.)) == 0)), myreg = lm(y~ names(filtered_df))) </code></pre> Any ideas? Thanks!

We can use <code>map</code> and <code>mutate</code>. We can either <code>select</code> and model in one step (<code>nestdat1</code>) or in separate steps using two <code>map</code>'s if you want to preserve the filtered data (<code>nestdat2</code>): <pre class="prettyprint"><code>library(tidyverse) nestdat1 <- mydata %>% group_by(group) %>% nest() %>% mutate(model = data %>% map(~ select_if(., funs(!any(is.na(.)))) %>% lm(y ~ ., data = .))) nestdat2 <- mydata %>% group_by(group) %>% nest() %>% mutate(data = data %>% map(~ select_if(., funs(!any(is.na(.))))), model = data %>% map(~ lm(y ~ ., data = .))) </code></pre> Output: They produce different <code>data</code> columns: <pre class="prettyprint"><code>> nestdat1 %>% pull(data) [[1]] # A tibble: 2 x 4 ind_1 ind_2 ind_3 y <dbl> <dbl> <dbl> <dbl> 1 NA 2 5 28 2 NA 3 6 34 [[2]] # A tibble: 2 x 4 ind_1 ind_2 ind_3 y <dbl> <dbl> <dbl> <dbl> 1 3 4 NA 25 2 4 5 NA 12 > nestdat2 %>% pull(data) [[1]] # A tibble: 2 x 3 ind_2 ind_3 y <dbl> <dbl> <dbl> 1 2 5 28 2 3 6 34 [[2]] # A tibble: 2 x 3 ind_1 ind_2 y <dbl> <dbl> <dbl> 1 3 4 25 2 4 5 12 </code></pre> But the same <code>model</code> column: <pre class="prettyprint"><code>> nestdat1 %>% pull(model) [[1]] Call: lm(formula = y ~ ., data = .) Coefficients: (Intercept) ind_2 ind_3 16 6 NA [[2]] Call: lm(formula = y ~ ., data = .) Coefficients: (Intercept) ind_1 ind_2 64 -13 NA > nestdat2 %>% pull(model) [[1]] Call: lm(formula = y ~ ., data = .) Coefficients: (Intercept) ind_2 ind_3 16 6 NA [[2]] Call: lm(formula = y ~ ., data = .) Coefficients: (Intercept) ind_1 ind_2 64 -13 NA </code></pre>

Here's another <code>tidyverse</code> option, assign to <code>mydata$model</code> if you wish to keep it in your <code>tibble</code> : <pre class="prettyprint"><code>library(tidyverse) mydata %>% nest(-group) %>% pull(data) %>% map(~lm(y ~., discard(.,anyNA))) # [[1]] # # Call: # lm(formula = y ~ ., data = discard(., anyNA)) # # Coefficients: # (Intercept) ind_2 ind_3 # 16 6 NA # # # [[2]] # # Call: # lm(formula = y ~ ., data = discard(., anyNA)) # # Coefficients: # (Intercept) ind_1 ind_2 # 64 -13 NA # # </code></pre>

select non-missing variables in a purrr loop

Tags:

r

dplyr

purrr

lm

Consider this example

mydata <- data_frame(ind_1 = c(NA,NA,3,4),
                     ind_2 = c(2,3,4,5),
                     ind_3 = c(5,6,NA,NA),
                     y = c(28,34,25,12),
                     group = c('a','a','b','b'))

> mydata
# A tibble: 4 x 5
  ind_1 ind_2 ind_3     y group
  <dbl> <dbl> <dbl> <dbl> <chr>
1    NA     2     5    28 a    
2    NA     3     6    34 a    
3     3     4    NA    25 b    
4     4     5    NA    12 b

Here I want, for each group, regress y on whatever variable is not missing in that group, and store the corresponding lm object in a list-column.

That is:

for group a, these variables correspond to ind_2 and ind_3
for group b, they correspond to ind_1 and ind_2

I tried the following but this does not work

mydata %>% group_by(group) %>% nest() %>% 
  do(filtered_df <- . %>% select(which(colMeans(is.na(.)) == 0)),
     myreg = lm(y~ names(filtered_df)))

Any ideas? Thanks!

989

asked Sep 10 '18 19:09

ℕʘʘḆḽḘ

2 Answers

We can use map and mutate. We can either select and model in one step (nestdat1) or in separate steps using two map's if you want to preserve the filtered data (nestdat2):

library(tidyverse)

nestdat1 <- mydata %>%
  group_by(group) %>%
  nest() %>%
  mutate(model = data %>% map(~ select_if(., funs(!any(is.na(.)))) %>%
                                lm(y ~ ., data = .)))

nestdat2 <- mydata %>%
  group_by(group) %>%
  nest() %>%
  mutate(data = data %>% map(~ select_if(., funs(!any(is.na(.))))),
         model = data %>% map(~ lm(y ~ ., data = .)))

Output:

They produce different data columns:

> nestdat1 %>% pull(data)
[[1]]
# A tibble: 2 x 4
  ind_1 ind_2 ind_3     y
  <dbl> <dbl> <dbl> <dbl>
1    NA     2     5    28
2    NA     3     6    34

[[2]]
# A tibble: 2 x 4
  ind_1 ind_2 ind_3     y
  <dbl> <dbl> <dbl> <dbl>
1     3     4    NA    25
2     4     5    NA    12

> nestdat2 %>% pull(data)
[[1]]
# A tibble: 2 x 3
  ind_2 ind_3     y
  <dbl> <dbl> <dbl>
1     2     5    28
2     3     6    34

[[2]]
# A tibble: 2 x 3
  ind_1 ind_2     y
  <dbl> <dbl> <dbl>
1     3     4    25
2     4     5    12

But the same model column:

> nestdat1 %>% pull(model)
[[1]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_2        ind_3  
         16            6           NA  

[[2]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_1        ind_2  
         64          -13           NA  


> nestdat2 %>% pull(model)
[[1]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_2        ind_3  
         16            6           NA  

[[2]]

Call:
lm(formula = y ~ ., data = .)

Coefficients:
(Intercept)        ind_1        ind_2  
         64          -13           NA

168

answered Oct 17 '22 00:10

acylam

Here's another tidyverse option, assign to mydata$model if you wish to keep it in your tibble :

library(tidyverse)
mydata %>%
  nest(-group) %>%
  pull(data) %>%
  map(~lm(y ~., discard(.,anyNA)))
# [[1]]
# 
# Call:
# lm(formula = y ~ ., data = discard(., anyNA))
# 
# Coefficients:
# (Intercept)        ind_2        ind_3  
#          16            6           NA  
# 
# 
# [[2]]
# 
# Call:
# lm(formula = y ~ ., data = discard(., anyNA))
# 
# Coefficients:
# (Intercept)        ind_1        ind_2  
#          64          -13           NA  
# 
#

answered Oct 17 '22 00:10

Moody_Mudskipper

Related questions
                            
                                error handling in R: access known object within function at time of function error
                            
                                How to compute land cover area in R
                            
                                Sum of some positions in a row - R
                            
                                Logistic Regression Tuning Parameter Grid in R Caret Package?
                            
                                Use perl=TRUE regex in dplyr select
                            
                                What is the Shiny default font?
                            
                                ggplot2 shade area under curve by group
                            
                                How do I print an input from the R shiny UI to the console?
                            
                                Why does complete() create duplicate rows in my data?
                            
                                dplyr::filter "No tidyselect variables were registered"
                            
                                Multiple colourpickers within splitLayout, colour box gets hidden
                            
                                DT in Shiny: Change only the colour of a single row
                            
                                Cumulative variance explained for NMDS in R
                            
                                Shade background of a ggplot chart using geom_rect with categorical variables
                            
                                R- how to conditionally remove first row of group_by
                            
                                Add directlabels to geom_smooth rather than geom_line
                            
                                What is the difference between paste/paste0 and str_c?
                            
                                How to format a difftime object to a string with HH:MM:SS
                            
                                ggplot aes_string doesn't work with spaces
                            
                                rmarkdown & kable/kableextra: Printing % symbol in Table when using escape = F

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With