I believe my question is as much about best practice as it is about tidying messy data, so here goes.
Below is an excerpt of the dataframe lang.df
, a school-wide dataset of students. The column, Langauge.Home
, indicates parent responses to the question:
"What languages do you speak at home?"
> lang.df
Nationality Language.Home
1 HK Mandarin
2 German Mandarin/English/German
3 Saudi Arabic
4 Norwegian Norwegian
5 UK English
6 HK Mandarin/ Min Nan dialect
7 Australian Mandarin
8 HK Mandarin
9 Brazilian Portuguese/English
10 Indian Hindi/English
It is obvious to me that this is a poor way to get this information as well as a poor way to store it, but my job is to use the data I have.
Outcome
I want to explore the effect that certain home languages might have on achievement. What I need is the ability to subset the by a single langauge spoken at home (e.g. the students who speak English at home).
To do so, it seems I have to separate the Language@home
column into three ("language.home1", "language.home2", "language.home3"
) using dplyr's separate()
. The create a new column for each unique value (i.e. language) in the new columns I created
Process
Below is my attempt at doing the above, efficiently
library(dplyr)
library(tidyr)
#separate Langauge.Home into three new columns
lang.df <- lang.df %>% separate(Language.Home,
c("language.home1", "language.home2", "language.home3"),
sep = "/",
remove = FALSE)
#find distinct languages & remove NAs
langs <- unique(c(lang.df$language.home1,
lang.df$language.home2,
lang.df$language.home3))
langs <- langs[!is.na(langs)]
#create boolean column for each unique language in new columns
for (i in langs) {
lang.df[,paste(i)] <- grepl(i, lang.df$Language.Home)
}
Questions
tidyr
docs and here on SO but couldn't find anything about it. Thanks in advance for your help. I've only been using R on-and-off for about a year now and this is my first SO post. Give me as much feedback as you can!
Data
lang.df <- structure(list(Nationality = structure(c(4L, 3L, 7L, 6L, 8L,
4L, 1L, 4L, 2L, 5L), .Label = c("Australian", "Brazilian", "German",
"HK", "Indian", "Norwegian", "Saudi", "UK"), class = "factor"),
`Language.Home` = structure(c(4L, 6L, 1L, 7L, 2L, 5L, 4L,
4L, 8L, 3L), .Label = c("Arabic", "English", "Hindi/English",
"Mandarin", "Mandarin/ Min Nan dialect", "Mandarin/English/German",
"Norwegian", "Portuguese/English"), class = "factor")), row.names = c(NA,
10L), .Names = c("Nationality", "Language.Home"), class = "data.frame")
We can use cSplit
from splitstackshape
to split the 'Language.Home' using the delimiter /
and to convert it to long
format.
library(splitstackshape)
library(data.table)
dt <- cSplit(lang.df, "Language.Home", "/", "long")
Then, use dcast
to convert from 'long' to 'wide'
dcast(dt, Nationality~Language.Home, fun.aggregate = function(x) length(x)>0)
NOTE: There are duplicate 'Nationality' rows, so the above will group the common elements together. It may be better to group it together.
If we need to have logical columns based on each row (irrespective of similar 'Nationality')
dcast(cSplit(setDT(lang.df, keep.rownames=TRUE), "Language.Home",
"/", "long"), rn +Nationality ~Language.Home, function(x) length(x) >0)
Or another option is mtabulate
from qdapTools
after splitting the 'Language.Home' by /
.
library(qdapTools)
cbind(lang.df, !!(mtabulate(setNames(strsplit(as.character(lang.df$Language.Home),
"/"), lang.df$Nationality))))
# Nationality Language.Home Min Nan dialect Arabic English German Hindi Mandarin Norwegian Portuguese
#1 HK Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#2 German Mandarin/English/German FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE
#3 Saudi Arabic FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#4 Norwegian Norwegian FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#5 UK English FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#6 HK Mandarin/ Min Nan dialect TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#7 Australian Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#8 HK Mandarin FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#9 Brazilian Portuguese/English FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#10 Indian Hindi/English FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
One easy way to get to long form is with tidyr::unnest()
:
library(dplyr)
library(tidyr)
library(stringr)
lang.df %>%
mutate(Language.Home = str_split(Language.Home, "/")) %>%
unnest()
#> Nationality Language.Home
#> 1 HK Mandarin
#> 2 German Mandarin
#> 3 German English
#> 4 German German
#> 5 Saudi Arabic
#> 6 Norwegian Norwegian
#> 7 UK English
#> 8 HK Mandarin
#> 9 HK Min Nan dialect
#> 10 Australian Mandarin
#> 11 HK Mandarin
#> 12 Brazilian Portuguese
#> 13 Brazilian English
#> 14 Indian Hindi
#> 15 Indian English
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With