In a data.frame
, I have a categorical variable for the language of a text. But, while most texts are only in one language, some have multiple languages. In my data, they appear in the same column, divided by comas:
text = c("Text1", "Text2", "Text3")
lang = c("fr", "en", "fr,en")
d = data.frame(text, lang)
Visually:
text lang
1 Text1 fr
2 Text2 en
3 Text3 fr,en
I'd like to plot the number of texts in each language, with Text3 being counted both in fr
and in en
.
I found how to split, with:
d$lang <- strsplit(d$lang, ",")
But then I can't find a way to plot it correctly, e.g. with a qplot
barplot like this one:
qplot(lang, data=d)
Am I doing it right? Is there a better approach?
You could try:
library(splitstackshape)
dl <- cSplit(d, "lang", ",", "long")
qplot(lang, data = dl)
Without following the suggestion in user20650's comment, you probably won't be able to get away without restructuring your data, and how you do that cannot be blind to the way the data is arbitrarily stored. For example, if you know that the languages are represented by distinct, two-character strings (so that, for example, any language representation that isn't "fr" does not contain the sequence "fr"), you could create new boolean columns based on searches for the codes in the comma-separated representation. For example:
# Data
text = c("Text1", "Text2", "Text3", "Text4", "Text5")
lang = c("fr", "en", "fr,en", "sp,fr", "sp,fr,en")
d = data.frame(text, lang, stringsAsFactors = FALSE)
# Get a vector of the languages that exist
languages <- unique(unlist(strsplit(d$lang, ",")))
# Create a new column for each language
for (language in languages) d[[language]] <- grepl(language, d$lang)
# An example bar-plot
barplot(colSums(d[, -c(1, 2)]))
Consider tidyr::separate()
to split and tidyr::gather()
to make it long.
library(magrittr)
ceiling <- 2L #The max language count of any single text
language_positions <- paste0("language_", seq_len(ceiling))
d %>%
tidyr::separate("lang", language_positions, sep=",", extra="merge") %>%
tidyr::gather_("ordinal", "language_name", language_positions) %>%
dplyr::filter(!is.na(language_name))
The resulting long dataset is:
text ordinal language_name
1 Text1 language_1 fr
2 Text2 language_1 en
3 Text3 language_1 fr
4 Text3 language_2 en
If you want to break it into two smaller steps. The separate()
creates a wide dataset,
> d_wide <- d %>%
+ tidyr::separate_("lang", language_positions, sep=",", extra="merge")
> d_wide
text language_1 language_2
1 Text1 fr <NA>
2 Text2 en <NA>
3 Text3 fr en
...and then gather()
converts it to tall.
d_long <- d_wide %>%
tidyr::gather_("ordinal", "language_name", language_positions) %>%
dplyr::filter(!is.na(language_name))
For other reasons, I suggest adding , stringsAsFactors=F
when you define d
, but tidyr
's separate functions don't seem to mind. The qplot call can remain the same: qplot(language_name, data=d_long)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With