Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently converting labelled variables to factors

I'm struggling with ways to efficiently turn labelled variables into factors. The dataset I'm working with is available from here: [https://www.dropbox.com/s/jhp780hd0ii3dnj/out.sav?dl=0][1]. It was an spss data file, which I like to use because of what my colleagues use.

When I read in the data, you can see that every single factor from the file is turned into a "labelled" class.

#load libraries
library(haven)
library(tidy)
library(dplyr)
#Import
test<-read_sav(path='~/your/path/name/out.sav')
#Structure
str(test)
#Find Class
sapply(test, class)

The first problem that I have is that ggplot2 doesn't know how to apply a scale to a labelled class.

#
td<-ford %>%
select(income, stress) %>%
group_by(income, stress)%>%
filter(is.na(stress)==FALSE)%>%
filter(is.na(income)==FALSE)%>%
summarize(Freq=n())%>%
mutate(Percent=(Freq/sum(Freq))*100)

#Draw plot
ggplot(td, aes(x=income, y=Percent, group=stress))+
#barplot
geom_bar(aes(fill=stress), stat='identity')

That can be solved quite nicely by wrapping the categorical variable 'income' in as_factor()

#Draw plot
ggplot(td, aes(x=as_ford(income), y=Percent, group=stress))+
#barplot
geom_bar(aes(fill=stress), stat='identity')

That works of rone variable, however, If I'm doing exploratory research , I may be doing a lot of plots with a lot of labelled variables. That strikes me as quite a lot of extra typing.

This problem is magnified with the problem of that when you gather a lot of variables to plot several crosstabs, you lose the value labels.

##Visualizations
test<-ford %>%
#The first two variables are the grouping, variables for a series of cross tabs
select(ford, stress,resp_gender, immigrant2, education,  property, commute,     cars, religion) %>%
#Some renamings
rename(gender=resp_gender, educ=education, immigrant=immigrant2,  relig=religion)%>%
#Melt all variables other than ford and stress
gather(variable, category, -ford, -stress)%>%
#Group by all variables
group_by(variable, category, ford, stress) %>%
#filter out missings
filter(is.na(stress)==FALSE&is.na(ford)==FALSE)%>%
#filter out missings
filter(is.na(value)==FALSE)%>%
#summarize
summarize(freq=n())

#Show plots
ggplot(test, aes(x=as_factor(value), y=freq,    group=as_factor(ford)))+geom_bar(stat='identity',position='dodge', aes(fill=as_factor(ford)))+facet_grid(~category, scales='free')

So, now all of the value labels for the variables that were melted have disappeared. So, the only way that I can see to prevent this is to individually use as_factor() to turn each labelled variable to a factor with the value labels as the factor levels. But, again, that is a lot of typing.

I guess my question is how to most efficiently to deal with the labelled class, turning them into factors, specifically as regards to ggplot2.

like image 399
spindoctor Avatar asked Sep 14 '25 11:09

spindoctor


1 Answers

It's been a while, and the answers are already there in the comments, but I'll post an answer using dplyr anyways.

library(haven)

# Load Stata file and look at it
nlsw88 <- read_dta('http://www.stata-press.com/data/r15/nlsw88.dta')
head(nlsw88)

We see that there are some labelled variables. If we only want to convert specific variables, we can use mutate_at from dplyr.

# Convert specific variables to factor
nlsw88 %>%
    mutate_at(
        vars('race'),
        funs(as_factor(.))
    ) %>%
    head()

Along Gregor's and aosmith's comments we can also convert all labelled variables using the mutate_if function, testing for the labelled class. This will save you a lot of extra typing.

# Convert all labelled variables to factor
nlsw88 %>%
    mutate_if(
        is.labelled,
        funs(as_factor(.))
    ) %>%
    head()

This can be used to create bar plots similar to what you described (although this particular plot might not make much sense):

nlsw88 %>%
    select(race, married, collgrad, union) %>%
    mutate_if(
        is.labelled,
        funs(as_factor(.))
    ) %>%
    gather(variable, category, -c(race, married)) %>%
    group_by(race, married, variable, category) %>%
    summarise(freq = n()) %>%
    filter(!is.na(category)) %>%
    ggplot(aes(x = category, y = freq)) +
    geom_bar(stat = 'identity', aes(fill=race)) +
    facet_grid(~married)
like image 60
David Avatar answered Sep 17 '25 00:09

David