Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separate a column in dataframe where each observation can have multiple concurrent values

Tags:

r

dplyr

tidyr

I believe my question is as much about best practice as it is about tidying messy data, so here goes.

Below is an excerpt of the dataframe lang.df, a school-wide dataset of students. The column, Langauge.Home, indicates parent responses to the question: "What languages do you speak at home?"

> lang.df
   Nationality             Language.Home
1           HK                  Mandarin
2       German   Mandarin/English/German
3        Saudi                    Arabic
4    Norwegian                 Norwegian
5           UK                   English
6           HK Mandarin/ Min Nan dialect
7   Australian                  Mandarin
8           HK                  Mandarin
9    Brazilian        Portuguese/English
10      Indian             Hindi/English

It is obvious to me that this is a poor way to get this information as well as a poor way to store it, but my job is to use the data I have.

Outcome

I want to explore the effect that certain home languages might have on achievement. What I need is the ability to subset the by a single langauge spoken at home (e.g. the students who speak English at home).

To do so, it seems I have to separate the Language@home column into three ("language.home1", "language.home2", "language.home3") using dplyr's separate(). The create a new column for each unique value (i.e. language) in the new columns I created

Process

Below is my attempt at doing the above, efficiently

library(dplyr)
library(tidyr)

#separate Langauge.Home into three new columns
lang.df <- lang.df %>% separate(Language.Home,
        c("language.home1", "language.home2", "language.home3"),
        sep = "/",
        remove = FALSE)

#find distinct languages & remove NAs
langs <- unique(c(lang.df$language.home1,
    lang.df$language.home2,
    lang.df$language.home3))
langs <- langs[!is.na(langs)]

#create boolean column for each unique language in new columns
for (i in langs) {
lang.df[,paste(i)] <- grepl(i, lang.df$Language.Home) 
}

Questions

  1. What is this situation called? I tried to look at the tidyr docs and here on SO but couldn't find anything about it.
  2. Is there a more elegant way to code the transformations than what I have done?
  3. What is best practice for
    • Getting this data (to change future data entry process)
    • Dealing with this situation from a statistical perspective

Thanks in advance for your help. I've only been using R on-and-off for about a year now and this is my first SO post. Give me as much feedback as you can!

Data

lang.df <- structure(list(Nationality = structure(c(4L, 3L, 7L, 6L, 8L, 
4L, 1L, 4L, 2L, 5L), .Label = c("Australian", "Brazilian", "German", 
"HK", "Indian", "Norwegian", "Saudi", "UK"), class = "factor"), 
`Language.Home` = structure(c(4L, 6L, 1L, 7L, 2L, 5L, 4L, 
4L, 8L, 3L), .Label = c("Arabic", "English", "Hindi/English", 
"Mandarin", "Mandarin/ Min Nan dialect", "Mandarin/English/German", 
"Norwegian", "Portuguese/English"), class = "factor")), row.names = c(NA, 
10L), .Names = c("Nationality", "Language.Home"), class = "data.frame")
like image 367
Anders Swanson Avatar asked Dec 25 '22 06:12

Anders Swanson


2 Answers

We can use cSplit from splitstackshape to split the 'Language.Home' using the delimiter / and to convert it to long format.

library(splitstackshape)
library(data.table)
dt <- cSplit(lang.df, "Language.Home", "/", "long")

Then, use dcast to convert from 'long' to 'wide'

dcast(dt, Nationality~Language.Home, fun.aggregate = function(x) length(x)>0)

NOTE: There are duplicate 'Nationality' rows, so the above will group the common elements together. It may be better to group it together.

If we need to have logical columns based on each row (irrespective of similar 'Nationality')

 dcast(cSplit(setDT(lang.df, keep.rownames=TRUE), "Language.Home",
   "/", "long"), rn +Nationality ~Language.Home, function(x) length(x) >0)

Or another option is mtabulate from qdapTools after splitting the 'Language.Home' by /.

 library(qdapTools)
 cbind(lang.df, !!(mtabulate(setNames(strsplit(as.character(lang.df$Language.Home), 
                 "/"), lang.df$Nationality))))
#   Nationality             Language.Home  Min Nan dialect Arabic English German Hindi Mandarin Norwegian Portuguese
#1           HK                  Mandarin            FALSE  FALSE   FALSE  FALSE FALSE     TRUE     FALSE      FALSE
#2       German   Mandarin/English/German            FALSE  FALSE    TRUE   TRUE FALSE     TRUE     FALSE      FALSE
#3        Saudi                    Arabic            FALSE   TRUE   FALSE  FALSE FALSE    FALSE     FALSE      FALSE
#4    Norwegian                 Norwegian            FALSE  FALSE   FALSE  FALSE FALSE    FALSE      TRUE      FALSE
#5           UK                   English            FALSE  FALSE    TRUE  FALSE FALSE    FALSE     FALSE      FALSE
#6           HK Mandarin/ Min Nan dialect             TRUE  FALSE   FALSE  FALSE FALSE     TRUE     FALSE      FALSE
#7   Australian                  Mandarin            FALSE  FALSE   FALSE  FALSE FALSE     TRUE     FALSE      FALSE
#8           HK                  Mandarin            FALSE  FALSE   FALSE  FALSE FALSE     TRUE     FALSE      FALSE
#9    Brazilian        Portuguese/English            FALSE  FALSE    TRUE  FALSE FALSE    FALSE     FALSE       TRUE
#10      Indian             Hindi/English            FALSE  FALSE    TRUE  FALSE  TRUE    FALSE     FALSE      FALSE
like image 135
akrun Avatar answered Dec 26 '22 20:12

akrun


One easy way to get to long form is with tidyr::unnest():

library(dplyr)
library(tidyr)
library(stringr)

lang.df %>% 
  mutate(Language.Home = str_split(Language.Home, "/")) %>% 
  unnest()
#>    Nationality    Language.Home
#> 1           HK         Mandarin
#> 2       German         Mandarin
#> 3       German          English
#> 4       German           German
#> 5        Saudi           Arabic
#> 6    Norwegian        Norwegian
#> 7           UK          English
#> 8           HK         Mandarin
#> 9           HK  Min Nan dialect
#> 10  Australian         Mandarin
#> 11          HK         Mandarin
#> 12   Brazilian       Portuguese
#> 13   Brazilian          English
#> 14      Indian            Hindi
#> 15      Indian          English
like image 41
hadley Avatar answered Dec 26 '22 20:12

hadley