Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a string by dashes outside of square brackets

Tags:

regex

r

I would like to split strings like the following:

x <- "abc-1230-xyz-[def-ghu-jkl---]-[adsasa7asda12]-s-[klas-bst-asdas foo]"

by dash (-) on the condition that those dashes must not be contained inside a pair of []. The expected result would be

c("abc", "1230", "xyz", "[def-ghu-jkl---]", "[adsasa7asda12]", "s",
     "[klas-bst-asdas foo]")

Notes:

  • There is no nesting of square brackets inside each other.
  • The square brackets can contain any characters / numbers / symbols except square brackets.
  • The other parts of the string are also variable so that we can only assume that we split by - whenever it's not inside [].

There's a similar question for python (How to split a string by commas positioned outside of parenthesis?) but I haven't yet been able to accurately adjust that to my scenario.

like image 594
talat Avatar asked Dec 08 '22 17:12

talat


2 Answers

You could use look ahead to verify that there is no ] following sooner than a [:

-(?![^[]*\])

So in R:

strsplit(x, "-(?![^[]*\\])", perl=TRUE)

Explanation:

  • -: match the hyphen
  • (?! ): negative look ahead: if that part is found after the previously matched hyphen, it invalidates the match of the hyphen.
    • [^[]: match any character that is not a [
    • *: match any number of the previous
    • \]: match a literal ]. If this matches, it means we found a ] before finding a [. As all this happens in a negative look ahead, a match here means the hyphen is not a match. Note that a ] is a special character in regular expressions, so it must be escaped with a backslash (although it does work without escape, as the engine knows there is no matching [ preceding it -- but I prefer to be clear about it being a literal). And as backslashes have a special meaning in string literals (they also denote an escape), that backslash itself must be escaped again in this string, so it appears as \\].
like image 147
trincot Avatar answered Jan 17 '23 15:01

trincot


Instead of splitting, extract the parts:

library(stringr)
str_extract_all(x, "(\\[[^\\[]*\\]|[^-])+")
like image 44
Christoph Wolk Avatar answered Jan 17 '23 16:01

Christoph Wolk