Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scala split on capturing group

Tags:

regex

split

scala

I am trying to split a string on only the capturing group of a regex, but I appear to be splitting on the entire match.

I would like to split hi|my~~|~|name is bob on |'s preceded by an zero or any even number of ~'s

So my expected output is Array(hi, my~~, ~|name is bob)

I am using the regex "(?<!~)(?:~~)*(\\|)"

But "hi|my~~|~|name is bob".split("(?<!~)(?:~~)*(\\|)") is returning Array[String] = Array(hi, my, ~|name is bob) because it is splitting on the entire ~~| after my instead of just the | that is preceeded by ~~.

For example compare:

scala> "(?<!~)(?:~~)*(\\|)".r.findAllIn("hi|my~~|~|name is bob").foreach(println)
|
~~|

to

scala> "(?<!~)(?:~~)*(\\|)".r.findAllIn("hi|my~~|~|name is bob").matchData foreach { m => println(m.group(1)) }
|
|

EDIT:

Some context and clarification:

I am trying to serialize a list of strings into a single string separated by |. I cannot guarantee that | (or any character for that matter) will not appear in an individual string.

To achieve the desired functionality I want to escape all occurrences of |. I have chosen the ~ as my escape character. Before I can escape | I need to escape ~.

Once I have escaped everything I can join the list with | to get a single string representing my original list of strings.

Then later to parse the single string back into the original list I need to split only on unescaped |'s. I have to be careful because something like ~~| is actually an unescaped pipe even though it contains ~|. This is because the escape character is itself escaped, which means it was just a "tilda" in one of my original strings and is not meant to function as an "escape". In other words I had a string ending in ~, and it is now escaped into ~~ and joined with the next string in the list by a '|'.

OK, so if my initial list of strings is ["hi","my~","|name is bob"] I will first escape all ~'s to get ["hi","my~~","|name is bob"]. Now I will escape all |'s to get ["hi","my~~","~|name is bob"], and finally I will join with | to get the single string:

"hi|my~~|~|name is bob"

Now if I want to reverse this I need to first split on unescaped |'s, which is any | preceded by zero or an even number of ~'s. So if I can achieve this with my regex (so far I am capturing this correctly in my capturing group, but I just don't know how to apply only the group and not the full ~~| match for example to the split), then I will get ["hi","my~~","~|name is bob"]. Now I simply unescape my ~'s, unescape my |, and I have arrived back at my original input:

["hi","my~","|name is bob"]

like image 643
Imran Avatar asked Sep 20 '25 05:09

Imran


1 Answers

You need all the ~s to be part of the look-behind group, since split splits on the whole match of the regex, not just a group of it, even if that group is a non-capturing group. A simpler example:

"asdf" split "(?:s)" //Array(a, df)

The look-behind group is not part of the match, so you want to put your prefix criteria in there. Basically, you need to wrap your solution in another look-behind group. Ideally, you'd want:

"""(?<=(?<!~)(~~)*)\|"""

But unfortunately Java doesn't support look-behind groups of arbitrary length. As a workaround, you can do:

"""(?<=(?<!~)(~~){0,10})\|"""

Which would work for even number of ~s as long as there are 20 or fewer. You could increase 10 if this is a problem.

If the nested look-behinds are confusing, you can also use the equivalent:

"""(?<![^~]~(~~){0,10})\|"""
like image 176
Ben Reich Avatar answered Sep 23 '25 11:09

Ben Reich