Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine data file and label file together to have one single labelled dataframe in R

I have two data frames, one is survey data (data.csv) and another one is label data (label.csv). Here is the sample data (My original data has about 150 variables)

#sample data

df <- tibble::tribble(
  ~id, ~House_member, ~dob, ~age_quota, ~work, ~sex, ~pss,
  1L,            4L,  1983L,  2L,        2L,     1,      1,
  2L,            1L,  1940L,  7L,        2L,     1,      2,
  3L,            2L,  1951L,  5L,        6L,     1,      1,
  4L,            4L,  1965L,  2L,        2L,     1,      4,
  5L,            3L,  1965L,  2L,        3L,     1,      1,
  6L,            1L,  1951L,  3L,        1L,     1,      3,
  7L,            1L,  1955L,  1L,        1L,     1,      3,
  8L,            4L,  1982L,  2L,        2L,     2,      5,
  9L,            2L,  1990L,  2L,        4L,     2,      3,
  10L,            2L,  1953L, 3L,        2L,     2,      4
)


#sample label data
label <- tibble::tribble(
                ~variable, ~value,                           ~label,
           "House_member",     NA, "How many people live with you?",
           "House_member",     1L,                       "1 person",
           "House_member",     2L,                      "2 persons",
           "House_member",     3L,                      "3 persons",
           "House_member",     4L,                      "4 persons",
           "House_member",     5L,                      "5 persons",
           "House_member",     6L,                      "6 persons",
           "House_member",     7L,                      "7 persons",
           "House_member",     8L,                      "8 persons",
           "House_member",     9L,                      "9 persons",
           "House_member",    10L,                     "10 or more",
                    "dob",     NA,                  "date of brith",
              "age_quota",     NA,                      "age_quota",
              "age_quota",     1L,                          "10-14",
              "age_quota",     2L,                          "15-19",
              "age_quota",     3L,                          "20-29",
              "age_quota",     4L,                          "30-39",
              "age_quota",     5L,                          "40-49",
              "age_quota",     6L,                          "50-70",
              "age_quota",     7L,                           "70 +",
                   "work",     NA,        "what is your occupation?",
                   "work",     1L,                      "full time",
                   "work",     2L,                      "part time",
                   "work",     3L,                        "retired",
                   "work",     4L,                        "student",
                   "work",     5L,                      "housewife",
                   "work",     6L,                     "unemployed",
                   "work",     7L,                          "other",
                   "work",     8L,                   "kid under 15",
                    "sex",     NA,                        "gender?",
                    "sex",     1L,                            "Man",
                    "sex",     2L,                          "Woman",
                    "pss",     NA,       "How often do you use PS?",
                    "pss",     1L,                          "Daily",
                    "pss",     2L,         "several times per week",
                    "pss",     3L,                  "once per week",
                    "pss",     4L,         "several time per month",
                    "pss",     5L,                          "Rarly"
           )

I am wondering is there any way that I can combine these file together to have a one labelled dataframe like SPSS’s style format (dbl+lbl format). I know labelled package which can add a value label to a non labelled vector, like this example:

v <- labelled::labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, maybe = 2, no = 3))

I was hoping there is a better/faster way than adding label to each variable one by one.

like image 541
DanG Avatar asked Dec 06 '22 08:12

DanG


1 Answers

Another imap_dfc solution:

library(tidyverse)

df %>% imap_dfc(~{ 
                  label[label$variable==.y,c('label','value')] %>%
                  deframe() %>% # to named vector
                  haven::labelled(.x,.)
                 })

# A tibble: 10 x 7
          id  House_member       dob age_quota           work       sex                        pss
   <int+lbl>     <int+lbl> <int+lbl> <int+lbl>      <int+lbl> <dbl+lbl>                  <dbl+lbl>
 1         1 4 [4 persons]      1983 2 [15-19] 2 [part time]  1 [Man]   1 [Daily]                 
 2         2 1 [1 person]       1940 7 [70 +]  2 [part time]  1 [Man]   2 [several times per week]
 3         3 2 [2 persons]      1951 5 [40-49] 6 [unemployed] 1 [Man]   1 [Daily]                 
 4         4 4 [4 persons]      1965 2 [15-19] 2 [part time]  1 [Man]   4 [several time per month]
 5         5 3 [3 persons]      1965 2 [15-19] 3 [retired]    1 [Man]   1 [Daily]                 
 6         6 1 [1 person]       1951 3 [20-29] 1 [full time]  1 [Man]   3 [once per week]         
 7         7 1 [1 person]       1955 1 [10-14] 1 [full time]  1 [Man]   3 [once per week]         
 8         8 4 [4 persons]      1982 2 [15-19] 2 [part time]  2 [Woman] 5 [Rarly]                 
 9         9 2 [2 persons]      1990 2 [15-19] 4 [student]    2 [Woman] 3 [once per week]         
10        10 2 [2 persons]      1953 3 [20-29] 2 [part time]  2 [Woman] 4 [several time per month]

Used tibble::deframe and haven::labelled which are included in tidyverse

Speed comparison after replacing filter/select by direct access to label:

Waldi <- function() {
df %>% imap_dfc(~{ 
    label[label$variable==.y,c('label','value')] %>%
      deframe() %>% # to named vector
      haven::labelled(.x,.)})}

Waldi_old <- function() {   
    df %>% imap_dfc(~{ 
      label %>% filter(variable==.y) %>%
        select(label, value) %>%
        deframe() %>% # to named vector
        haven::labelled(.x,.)
    })}

#EDIT : Included TIC33() for-loop solution

microbenchmark::microbenchmark(TIC3(),Waldi(),Anil(),TIC1(),Waldi_old(),Sinh())
Unit: microseconds
        expr     min       lq      mean   median       uq     max neval   cld
      TIC3()   688.0   871.80   982.280   920.95  1005.55  1801.6   100 a    
     Waldi()  1345.5  1543.60  1804.758  1635.45  1893.75  4306.8   100  b   
      Anil()  4006.8  4476.65  5188.519  4862.95  5439.10 10163.6   100   c  
      TIC1()  3898.2  4278.80  5009.927  4774.95  5277.05 12916.2   100   c  
 Waldi_old() 18712.3 20091.75 21756.140 20609.35 22169.75 33359.8   100    d 
      Sinh() 22730.9 24093.45 25931.412 24946.00 26614.00 38735.3   100     e
like image 77
Waldi Avatar answered Jan 17 '23 21:01

Waldi