I would like to split strings like the following:
x <- "abc-1230-xyz-[def-ghu-jkl---]-[adsasa7asda12]-s-[klas-bst-asdas foo]"
by dash (-
) on the condition that those dashes must not be contained inside a pair of []
. The expected result would be
c("abc", "1230", "xyz", "[def-ghu-jkl---]", "[adsasa7asda12]", "s",
"[klas-bst-asdas foo]")
Notes:
-
whenever it's not inside []
.There's a similar question for python (How to split a string by commas positioned outside of parenthesis?) but I haven't yet been able to accurately adjust that to my scenario.
You could use look ahead to verify that there is no ]
following sooner than a [
:
-(?![^[]*\])
So in R:
strsplit(x, "-(?![^[]*\\])", perl=TRUE)
-
: match the hyphen(?! )
: negative look ahead: if that part is found after the previously matched hyphen, it invalidates the match of the hyphen.
[^[]
: match any character that is not a [
*
: match any number of the previous\]
: match a literal ]
. If this matches, it means we found a ]
before finding a [
. As all this happens in a negative look ahead, a match here means the hyphen is not a match. Note that a ]
is a special character in regular expressions, so it must be escaped with a backslash (although it does work without escape, as the engine knows there is no matching [
preceding it -- but I prefer to be clear about it being a literal). And as backslashes have a special meaning in string literals (they also denote an escape), that backslash itself must be escaped again in this string, so it appears as \\]
.Instead of splitting, extract the parts:
library(stringr)
str_extract_all(x, "(\\[[^\\[]*\\]|[^-])+")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With