Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex and tidyr in R to split column variable on first instance of match

Tags:

regex

r

tidyr

Trying to split a column in an R data frame that has more than one space in the variable, but I want to split on just the first space. An example data frame:

df <- data.frame(game = c(1, 2, 3, 4, 5, 6), date = c("Monday Apr 3", "Tuesday Apr 4", "Wednesday Apr 5", "Thursday Apr 6", "Friday Apr 7", "Saturday Apr 8"))

I'm trying to use tidyr to split the df 'date' column on just the first space so that the day is in its own column:

  game       day date
1    1    Monday  Apr 3
2    2   Tuesday  Apr 4
3    3 Wednesday  Apr 5
4    4  Thursday  Apr 6
5    5    Friday  Apr 7
6    6  Saturday  Apr 8

The above is the problem. The below is what I've tried and what is going wrong.

By the tidyr documentation, the default value of 'sep' is 'a regular expression that matches any sequence of non-alphanumeric values.' So if I just do:

df %>% separate(date, c("day", "date"))

That will split on the space but it splits on both spaces(e.g. the space after 'Monday' and the space after 'Apr' in 'Monday Apr 3'). The result is:

  game       day date
1    1    Monday  Apr
2    2   Tuesday  Apr
3    3 Wednesday  Apr
4    4  Thursday  Apr
5    5    Friday  Apr
6    6  Saturday  Apr
Warning message:
Too many values at 6 locations: 1, 2, 3, 4, 5, 6 

I can add the regex to select just the first space (and I checked that this regex worked in Sublime Text):

df %>% separate(date, c("day", "date"), sep='^[^\\s]*\\K\\s')

But that gives me:

  game             day date
1    1    Monday Apr 3 <NA>
2    2   Tuesday Apr 4 <NA>
3    3 Wednesday Apr 5 <NA>
4    4  Thursday Apr 6 <NA>
5    5    Friday Apr 7 <NA>
6    6  Saturday Apr 8 <NA>
Warning message:
Too few values at 6 locations: 1, 2, 3, 4, 5, 6 

So what is going wrong? Or how do I make this work? Or what obvious thing am I not understanding?

like image 382
noLongerRandom Avatar asked Dec 14 '22 01:12

noLongerRandom


2 Answers

You need to specify the extra parameter to be merge:

library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")

#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8
like image 141
Psidom Avatar answered Jan 18 '23 23:01

Psidom


We can do this easily using base R

cbind(df[1], read.csv(text=sub("\\s+", ",", df$date),
             header=FALSE, col.names = c("day", "date")))
#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8

Or another option is extract from tidyr

library(tidyr)
extract(df, date, into = c("day", "date"), "(\\S+)\\s+(.*)")
#   game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8
like image 39
akrun Avatar answered Jan 19 '23 00:01

akrun