Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r split on delimiter not in parentheses

Tags:

regex

r

I am currently trying to split a string on the pipe delimiter: 

999|150|222|(123|145)|456|12,260|(10|10000)

The catch is I don't want to split on | inside of parentheses, I only want to split on this character outside of parentheses.

This is just splitting on every | character, yielding the results I don't want:

x <- '999|150|222|(123|145)|456|12,260|(10|10000)'
m <- strsplit(x, '\\|')

[[1]]
[1] "999"    "150"    "222"    "(123"   "145)"   "456"    "12,260" "(10"   
[9] "10000)"

I am looking to get the following results keeping everything inside of parentheses:

[[1]]
[1] "999"        "150"        "222"        "(123|145)"  "456"       
[6] "12,260"     "(10|10000)"

Any help appreciated.

like image 544
user3856888 Avatar asked Nov 30 '22 11:11

user3856888


1 Answers

You can switch on PCRE by using perl=T and some dark magic:

x <- '999|150|222|(123|145)|456|12,260|(10|10000)'
strsplit(x, '\\([^)]*\\)(*SKIP)(*F)|\\|', perl=T)

# [[1]]
# [1] "999"        "150"        "222"        "(123|145)"  "456"       
# [6] "12,260"     "(10|10000)"

The idea is to skip content in parentheses. Live demo

On the left side of the alternation operator we match anything in parentheses making the subpattern fail and force the regular expression engine to not retry the substring using backtracking control. The right side of the alternation operator matches | (outside of parentheses, what we want...)

like image 198
hwnd Avatar answered Dec 02 '22 02:12

hwnd