More concise option to `separate` a column in R (maybe through some RegEx)?

Question

I have a dataframe where I want to separate a column that contains month and year:

library(tidyverse)
df <- data.frame(
  month_year = c("Januar / Janvier 1990", "Februar / Février 1990","März / Mars 1990")
)

# df
#               month_year
# 1  Januar / Janvier 1990
# 2 Februar / Février 1990
# 3       März / Mars 1990

The following works, but seems quite a bit clunky:

df %>% 
  separate(month_year, c("month","nothing","nothing2", "year"), sep = " ") %>%
  select(-starts_with("nothing"))

#     month year
# 1  Januar 1990
# 2 Februar 1990
# 3    März 1990

Is there a more concise option to achieve the same result?

G. Grothendieck · Accepted Answer

1) separate Use NA to omit the unwanted field like this:

library(tidyr)

df %>% separate(month_year, c("month", NA, "year"))
##     month year
## 1  Januar 1990
## 2 Februar 1990
## 3    März 1990

@Otto pointed out that this has problems in UTF8. If that is your situation add the sep= value shown. separate uses a default of "[^[:alnum:]]+" which does not handle UTF8 but we can specify either of these instead:

"[^\p{L}\d]+" . This replaces "[:alnum:]" with "\p{L}" which is any letter in any language and "\d" which is any digit, or
"(*UCP)[^[:alnum:]]+" which uses a unicode specifier as a prefix

This shows an example. First we create an input df2 which exhibits the problem and then we use one of the above two sep values with it.

df <- data.frame(
  month_year = c("Januar / Janvier 1990", "Februar / Février 1990","März / Mars 1990"))
df2 <- df %>% mutate(month_year = iconv(month_year, to = "UTF8"))

df2 %>% separate(month_year, c("month", NA, "year"), sep = "[^\p{L}\d]+")
##     month year
## 1  Januar 1990
## 2 Februar 1990
## 3    März 1990

2) read.table and here is a base solution:

read.table(text = df[[1]], col.names = c("month", NA, NA, "year"))[-(2:3)]
##     month year
## 1  Januar 1990
## 2 Februar 1990
## 3    März 1990

3) read.pattern This picks out the desired fields using read.pattern. (\w+) captures the first word and (\d+) captures the year.

library(gsubfn)

pat <- "(\w+).* (\d+)"
read.pattern(text = df[[1]], pattern = pat, col.names = c("month", "year"))
##     month year
## 1  Januar 1990
## 2 Februar 1990
## 3    März 1990

r2evans · Answer

base R

strcapture("^(.*)\s+/.*\s+([^\s]+)$", df$month_year, proto = c(month="", year=1L))
#     month year
# 1  Januar 1990
# 2 Februar 1990
# 3    März 1990

Perhaps a little clunky:

setNames(do.call(rbind.data.frame,
    lapply(strsplit(df$month_year, "\s+"), function(z) z[c(1, length(z))])),
  c("month", "year"))

dplyr

A ever-so-slight reduction of your code, with a different regex:

library(dplyr)
df %>%
  separate(month_year, c("month", "ign", "year"), "[ /]+") %>%
  select(-ign)

or

df %>%
  mutate(month_year = gsub("/.* ", "", month_year)) %>%
  separate(month_year, c("month", "year"), " ")

TarJae · Answer

We could use word from stringr package:

library(dplyr)
library(stringr)

df %>% 
  mutate(month = word(month_year, 1),
         year = word(month_year, 4), .keep="unused")

    month year
1  Januar 1990
2 Februar 1990
3    März 1990

ThomasIsCoding · Answer

Try the following base R code with read.table + gsub

read.table(
  text = c(names(df), gsub("\s+.*\s+", "_", df$month_year)),
  sep = "_",
  header = TRUE
)

which gives

    month year
1  Januar 1990
2 Februar 1990
3    MΣrz 1990

Otto Kässi · Answer

Tidyverse + stringr

library(stringr)
df %>% mutate(year = as.numeric(str_extract(.$month_year, '\d+'))) %>%
        mutate(month = str_extract(.$month_year, '[^ /]+') )
              month_year year   month
1  Januar / Janvier 1990 1990  Januar
2 Februar / Février 1990 1990 Februar
3       März / Mars 1990 1990    März

'\d+' captures all digits; [^ /] captures substring before the first occurrence of /.

More concise option to `separate` a column in R (maybe through some RegEx)?

Tags:

r

tidyverse

symbolrush

5 Answers

G. Grothendieck

base R

dplyr

r2evans

TarJae

ThomasIsCoding

Otto Kässi

Recent Activity

Donate For Us

More concise option to `separate` a column in R (maybe through some RegEx)?

Tags:

r

tidyverse

symbolrush

5 Answers

G. Grothendieck

base R

dplyr

r2evans

TarJae

ThomasIsCoding

Otto Kässi

Related questions

Recent Activity

Donate For Us