Suppose I have a dataframe like this:
df<-data.frame(a=c("AA","BB"),b=c("short string","this is the longer string"))
I would like to split each string using a regex based on the last space occuring. I tried:
library(dplyr)
library(tidyr)
df%>%
separate(b,c("partA","partB"),sep=" [^ ]*$")
But this omits the second part of the string in the output. My desired output would look like this:
a partA partB
1 AA short string
2 BB this is the longer string
How do I do this. Would be nice if I could use tidyr and dplyr for this.
You may turn the [^ ]*$
part of your regex into a (?=[^ ]*$)
non-consuming pattern, a positive lookahead (that will not consume the non-whitespace chars at the end of the string, i.e. they won't be put into the match value and thus will stay there in the output):
df%>%
separate(b,c("partA","partB"),sep=" (?=[^ ]*$)")
Or, a bit more universal since it matches any whitespace chars:
df %>%
separate(b,c("partA","partB"),sep="\\s+(?=\\S*$)")
See the regex demo and its graph below:
Output:
a partA partB
1 AA short string
2 BB this is the longer string
We can use extract
from tidyr
by using the capture groups ((...)
). We match zero or more characters (.*
) and place it within the parentheses ((.*)
), followed by zero or more space (\\s+
), followed by the next capture group which includes only characters that are not a space ([^ ]
) until the end ($
) of the string.
library(tidyr)
extract(df, b, into = c('partA', 'partB'), '(.*)\\s+([^ ]+)$')
# a partA partB
#1 AA short string
#2 BB this is the longer string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With