I have a dataframe where I want to separate a column that contains month and year:
library(tidyverse)
df <- data.frame(
month_year = c("Januar / Janvier 1990", "Februar / Février 1990","März / Mars 1990")
)
# df
# month_year
# 1 Januar / Janvier 1990
# 2 Februar / Février 1990
# 3 März / Mars 1990
The following works, but seems quite a bit clunky:
df %>%
separate(month_year, c("month","nothing","nothing2", "year"), sep = " ") %>%
select(-starts_with("nothing"))
# month year
# 1 Januar 1990
# 2 Februar 1990
# 3 März 1990
Is there a more concise option to achieve the same result?
1) separate Use NA to omit the unwanted field like this:
library(tidyr)
df %>% separate(month_year, c("month", NA, "year"))
## month year
## 1 Januar 1990
## 2 Februar 1990
## 3 März 1990
@Otto pointed out that this has problems in UTF8. If that is your situation add the sep= value shown. separate
uses a default of "[^[:alnum:]]+"
which does not handle UTF8 but we can specify either of these instead:
"[^\\p{L}\\d]+"
. This replaces "[:alnum:]"
with "\\p{L}"
which is any letter in any language and "\\d"
which is any digit, or"(*UCP)[^[:alnum:]]+"
which uses a unicode specifier as a prefixThis shows an example. First we create an input df2 which exhibits the problem and then we use one of the above two sep values with it.
df <- data.frame(
month_year = c("Januar / Janvier 1990", "Februar / Février 1990","März / Mars 1990"))
df2 <- df %>% mutate(month_year = iconv(month_year, to = "UTF8"))
df2 %>% separate(month_year, c("month", NA, "year"), sep = "[^\\p{L}\\d]+")
## month year
## 1 Januar 1990
## 2 Februar 1990
## 3 März 1990
2) read.table and here is a base solution:
read.table(text = df[[1]], col.names = c("month", NA, NA, "year"))[-(2:3)]
## month year
## 1 Januar 1990
## 2 Februar 1990
## 3 März 1990
3) read.pattern This picks out the desired fields using read.pattern. (\\w+)
captures the first word and (\\d+)
captures the year.
library(gsubfn)
pat <- "(\\w+).* (\\d+)"
read.pattern(text = df[[1]], pattern = pat, col.names = c("month", "year"))
## month year
## 1 Januar 1990
## 2 Februar 1990
## 3 März 1990
strcapture("^(.*)\\s+/.*\\s+([^\\s]+)$", df$month_year, proto = c(month="", year=1L))
# month year
# 1 Januar 1990
# 2 Februar 1990
# 3 März 1990
Perhaps a little clunky:
setNames(do.call(rbind.data.frame,
lapply(strsplit(df$month_year, "\\s+"), function(z) z[c(1, length(z))])),
c("month", "year"))
A ever-so-slight reduction of your code, with a different regex:
library(dplyr)
df %>%
separate(month_year, c("month", "ign", "year"), "[ /]+") %>%
select(-ign)
or
df %>%
mutate(month_year = gsub("/.* ", "", month_year)) %>%
separate(month_year, c("month", "year"), " ")
We could use word
from stringr
package:
library(dplyr)
library(stringr)
df %>%
mutate(month = word(month_year, 1),
year = word(month_year, 4), .keep="unused")
month year
1 Januar 1990
2 Februar 1990
3 März 1990
Try the following base R code with read.table
+ gsub
read.table(
text = c(names(df), gsub("\\s+.*\\s+", "_", df$month_year)),
sep = "_",
header = TRUE
)
which gives
month year
1 Januar 1990
2 Februar 1990
3 MΣrz 1990
Tidyverse + stringr
library(stringr)
df %>% mutate(year = as.numeric(str_extract(.$month_year, '\\d+'))) %>%
mutate(month = str_extract(.$month_year, '[^ /]+') )
month_year year month
1 Januar / Janvier 1990 1990 Januar
2 Februar / Février 1990 1990 Februar
3 März / Mars 1990 1990 März
'\\d+'
captures all digits; [^ /]
captures substring before the first occurrence of /
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With