Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I apply a custom function that adds new columns to a dataframe to a subset of existing columns?

I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.

Here is an example of what I am trying to do with made-up data:

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))

For example, the code below shows an example of what I want to do for a single pair of columns.

mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))

#OUTPUT:

    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:

col.list = c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

#OUTPUT:

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2  v
1  Sarah        1        1     NA      1    yum  gross  1
2   Ella        4        4      2      2    yum    yum  4
3 Carmen       NA        3      0      1   <NA>    yum  3
4  Dinah        6        5      4      4  gross   <NA>  6
5 Billie       NA       NA     NA      2   <NA>  gross NA

[[2]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1  Sarah        1        1     NA      1    yum  gross 1
2   Ella        4        4      2      2    yum    yum 2
3 Carmen       NA        3      0      1   <NA>    yum 0
4  Dinah        6        5      4      4  gross   <NA> 4
5 Billie       NA       NA     NA      2   <NA>  gross 2

[[3]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2     v
1  Sarah        1        1     NA      1    yum  gross   yum
2   Ella        4        4      2      2    yum    yum   yum
3 Carmen       NA        3      0      1   <NA>    yum   yum
4  Dinah        6        5      4      4  gross   <NA> gross
5 Billie       NA       NA     NA      2   <NA>  gross gross

The problems with this attempt are:

  1. the new columns are not correctly named (they should be named cheese, milk and tofu rather than all be called v)
  2. the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in col.list).
like image 901
sf26749 Avatar asked Oct 29 '21 21:10

sf26749


1 Answers

(1) You have to wrap v into the curly-curly operator and use :=:

library(dplyr)

col.list <- c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

returns

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

[...]

which is one step closer to your desired output.

(2) But to get your desired output, I suggest using purrr:

library(purrr)
library(dplyr)

col.list %>% 
  map(~mydata %>% 
        select(name, starts_with(.x)) %>% 
        mutate({{ .x }} := ifelse(
          is.na(!!sym(paste0(.x, ".1"))), 
          !!sym(paste0(.x, ".2")), 
          !!sym(paste0(.x, ".1"))
          )
        )
  ) %>% 
  reduce(left_join, by = "name")

This returns

    name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1  Sarah        1        1      1     NA      1    1    yum  gross   yum
2   Ella        4        4      4      2      2    2    yum    yum   yum
3 Carmen       NA        3      3      0      1    0   <NA>    yum   yum
4  Dinah        6        5      6      4      4    4  gross   <NA> gross
5 Billie       NA       NA     NA     NA      2    2   <NA>  gross gross
like image 159
Martin Gal Avatar answered Sep 21 '22 12:09

Martin Gal