Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Handling multiple values as one string in a single variable

Tags:

r

In a data.frame, I have a categorical variable for the language of a text. But, while most texts are only in one language, some have multiple languages. In my data, they appear in the same column, divided by comas:

text = c("Text1", "Text2", "Text3")
lang = c("fr", "en", "fr,en")
d = data.frame(text, lang)

Visually:

   text  lang
1 Text1    fr
2 Text2    en
3 Text3 fr,en

I'd like to plot the number of texts in each language, with Text3 being counted both in fr and in en.

I found how to split, with:

d$lang <- strsplit(d$lang, ",")

But then I can't find a way to plot it correctly, e.g. with a qplot barplot like this one:

qplot(lang, data=d)

Am I doing it right? Is there a better approach?

like image 219
iNyar Avatar asked May 02 '15 00:05

iNyar


3 Answers

You could try:

library(splitstackshape)
dl <- cSplit(d, "lang", ",", "long")
qplot(lang, data = dl)
like image 113
Steven Beaupré Avatar answered Nov 09 '22 18:11

Steven Beaupré


Without following the suggestion in user20650's comment, you probably won't be able to get away without restructuring your data, and how you do that cannot be blind to the way the data is arbitrarily stored. For example, if you know that the languages are represented by distinct, two-character strings (so that, for example, any language representation that isn't "fr" does not contain the sequence "fr"), you could create new boolean columns based on searches for the codes in the comma-separated representation. For example:

# Data
text = c("Text1", "Text2", "Text3", "Text4", "Text5")
lang = c("fr", "en", "fr,en", "sp,fr", "sp,fr,en")
d = data.frame(text, lang, stringsAsFactors = FALSE)

# Get a vector of the languages that exist
languages <- unique(unlist(strsplit(d$lang, ",")))

# Create a new column for each language
for (language in languages) d[[language]] <- grepl(language, d$lang)

# An example bar-plot
barplot(colSums(d[, -c(1, 2)]))
like image 24
Richard Ambler Avatar answered Nov 09 '22 18:11

Richard Ambler


Consider tidyr::separate() to split and tidyr::gather() to make it long.

library(magrittr)
ceiling <- 2L #The max language count of any single text
language_positions <- paste0("language_", seq_len(ceiling))

d %>%
  tidyr::separate("lang", language_positions, sep=",", extra="merge") %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

The resulting long dataset is:

   text    ordinal language_name
1 Text1 language_1            fr
2 Text2 language_1            en
3 Text3 language_1            fr
4 Text3 language_2            en

If you want to break it into two smaller steps. The separate() creates a wide dataset,

> d_wide <- d %>%
+   tidyr::separate_("lang", language_positions, sep=",", extra="merge")
> d_wide
   text language_1 language_2
1 Text1         fr       <NA>
2 Text2         en       <NA>
3 Text3         fr         en

...and then gather() converts it to tall.

d_long <- d_wide %>%
  tidyr::gather_("ordinal", "language_name", language_positions) %>%
  dplyr::filter(!is.na(language_name))

For other reasons, I suggest adding , stringsAsFactors=F when you define d, but tidyr's separate functions don't seem to mind. The qplot call can remain the same: qplot(language_name, data=d_long).

like image 32
wibeasley Avatar answered Nov 09 '22 19:11

wibeasley