Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract the labels attribute from "labeled" tibble columns from a haven import from Stata

Hadley Wickham's haven package, applied to a Stata file, returns a tibble with many columns of type "labeled". You can see these with str(), e.g.:

$ MSACMSZ    :Class 'labelled'  atomic [1:8491861] NA NA NA NA NA NA NA NA NA NA ...
  .. ..- attr(*, "label")= chr "metropolitan area size (cmsa/msa)"
  .. ..- attr(*, "labels")= Named int [1:7] 0 1 2 3 4 5 6
  .. .. ..- attr(*, "names")= chr [1:7] "not identified or nonmetropolitan" "100,000 - 249,999" "250,000 - 499,999" "500,000 - 999,999" ...

It would be nice if I could simply extract all these labeled vectors to factors, but I have compared the length of the labels attribute to the number of unique values in each vector, and it is sometimes longer and sometimes shorter. So I think I need to look at all of them and decide how to handle each one individually.

So I would like to extract the values of the labels attribute to a list. However, this function:

labels93 <- lapply(cps_00093.df, function(x){attr(X, which="labels", exact=TRUE)})

returns NULL for all variables.

Is this a tibble vs data frame problem? How do I extract these attributes from the tibble columns into a list?

Note that the labels vector is named, and I need both the labels and the names.

As per @Hack-R's request here is a tiny snippet of my data as converted by dput (which I had never used before). I applied this code:

filter(cps_00093.df, YEAR==2015) %>%
  sample_n(10)  %>%
  select(HHTENURE, HHINTYPE) -> tiny
dput(tiny, file = "tiny")

to produce the file tiny. Hey! That was easy! I thought it would be hard to break off a piece this small.

Opening tiny with Notepad++, this is what I found:

structure(list(HHTENURE = structure(c(2L, 1L, 1L, 2L, 1L, 1L, 
1L, 2L, 1L, 1L), labels = structure(c(0L, 1L, 2L, 3L, 6L, 7L), .Names = c("niu", 
"owned or being bought", "rented for cash", "occupied without payment of cash rent", 
"refused", "don't know")), class = "labelled"), HHINTYPE = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), labels = structure(1:3, .Names = c("interview", 
"type a non-interview", "type b/c non-interview")), class = "labelled")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("HHTENURE", 
"HHINTYPE"))

I suspect this could be made more readable with a little spacing, but I did not want to muck with it for fear of accidentally destroying relevant information.

like image 970
andrewH Avatar asked Sep 24 '16 01:09

andrewH


2 Answers

The original question asks how 'to extract the values of the labels attribute to a list.' A solution to the main question follows (assuming some_df is imported via haven and has label attributes). Update: I've now added a way to extract a label vector with the package sjlabelled.

library(purrr)
n <- ncol(some_df)
labels_list <- map(1:n, function(x) attr(some_df[[x]], "label") )

# if a vector of character strings is preferable
labels_vector <- map_chr(1:n, function(x) attr(some_df[[x]], "label") )

# to make a simple codebook
library(kable)
variable_name <- names(some_df)
data.frame(variable_name, description = labels_vector) %>%
  kable(format = 'markdown')

# UPDATE: another approach with package sjlabelled
library(sjlabelled)
sjlabelled::get_label(some_df)
like image 70
Omar Wasow Avatar answered Nov 15 '22 18:11

Omar Wasow


I'm going to take a go at answering this one, though my code isn't very pretty.

First I make a function to extract a named attribute from a single column.

ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
  atr <- attr(x, attrC, exact = TRUE)
  atr <- if (is.null(atr)) {ifIsNull} else {atr}
  atr
}

Then a function to lapply it to all the columns:

AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
  lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}

Finally I run it for each attribute.

stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)

labels93 <- AtribLst(cps_00093.df, attrC="labels", isNullC=NA)
labels93 <- labels93[!is.na(labels93)]

All the columns have a "label" attribute, but only some are of type "labeled" and so have a "labels" attribute. The labels attribute is named, where the labels match values of the data and the names tell you what those values signify.

like image 31
andrewH Avatar answered Nov 15 '22 17:11

andrewH