Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R / stringr: split string, but keep the delimiters in the output

Tags:

I tried to search for the solution, but it appears that there is no clear one for R.
I try to split the string by the pattern of, let's say, space and capital letter and I use stringr package for that.

x <- "Foobar foobar, Foobar foobar"
str_split(x, " [:upper:]")

Normally I would get:

[[1]]
[1] "Foobar foobar," "oobar foobar"  

The output I would like to get, however, should include the letter from the delimiter:

[[1]]
[1] "Foobar foobar," "Foobar foobar"

Probably there is no out of box solution in stringr like back-referencing, so I would be happy to get any help.

like image 268
perechen Avatar asked Jun 01 '18 20:06

perechen


People also ask

How do you split a string into parts based on a delimiter?

split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.

Which string function splits a string based on a delimiter?

Using split() The following example defines a function that splits a string into an array of strings using separator .

When splitting a string using a given separator it returns?

The Split method extracts the substrings in this string that are delimited by one or more of the strings in the separator parameter, and returns those substrings as elements of an array.


1 Answers

You may split with 1+ whitespaces that are followed with an uppercase letter:

> str_split(x, "\\s+(?=[[:upper:]])")
[[1]]
[1] "Foobar foobar," "Foobar foobar" 

Here,

  • \\s+ - 1 or more whitespaces
  • (?=[[:upper:]]) - a positive lookahead (a non-consuming pattern) that only checks for an uppercase letter immediately to the right of the current location in string without adding it to the match value, thus, preserving it in the output.

Note that \s matches various whitespace chars, not just plain regular spaces. Also, it is safer to use [[:upper:]] rather than [:upper:] - if you plan to use the patterns with other regex engines (like PCRE, for example).

like image 117
Wiktor Stribiżew Avatar answered Oct 11 '22 13:10

Wiktor Stribiżew