I am trying to split a string on only the capturing group of a regex, but I appear to be splitting on the entire match.
I would like to split hi|my~~|~|name is bob
on |
's preceded by an zero or any even number of ~
's
So my expected output is Array(hi, my~~, ~|name is bob)
I am using the regex "(?<!~)(?:~~)*(\\|)"
But "hi|my~~|~|name is bob".split("(?<!~)(?:~~)*(\\|)")
is returning Array[String] = Array(hi, my, ~|name is bob)
because it is splitting on the entire ~~|
after my
instead of just the |
that is preceeded by ~~
.
For example compare:
scala> "(?<!~)(?:~~)*(\\|)".r.findAllIn("hi|my~~|~|name is bob").foreach(println)
|
~~|
to
scala> "(?<!~)(?:~~)*(\\|)".r.findAllIn("hi|my~~|~|name is bob").matchData foreach { m => println(m.group(1)) }
|
|
EDIT:
Some context and clarification:
I am trying to serialize a list of strings into a single string separated by |
. I cannot guarantee that |
(or any character for that matter) will not appear in an individual string.
To achieve the desired functionality I want to escape all occurrences of |
. I have chosen the ~
as my escape character. Before I can escape |
I need to escape ~
.
Once I have escaped everything I can join the list with |
to get a single string representing my original list of strings.
Then later to parse the single string back into the original list I need to split only on unescaped |
's. I have to be careful because something like ~~|
is actually an unescaped pipe even though it contains ~|
. This is because the escape character is itself escaped, which means it was just a "tilda" in one of my original strings and is not meant to function as an "escape". In other words I had a string ending in ~
, and it is now escaped into ~~
and joined with the next string in the list by a '|'.
OK, so if my initial list of strings is ["hi","my~","|name is bob"]
I will first escape all ~
's to get ["hi","my~~","|name is bob"]
. Now I will escape all |
's to get ["hi","my~~","~|name is bob"]
, and finally I will join with |
to get the single string:
"hi|my~~|~|name is bob"
Now if I want to reverse this I need to first split on unescaped |
's, which is any |
preceded by zero or an even number of ~
's. So if I can achieve this with my regex (so far I am capturing this correctly in my capturing group, but I just don't know how to apply only the group and not the full ~~|
match for example to the split), then I will get ["hi","my~~","~|name is bob"]
. Now I simply unescape my ~
's, unescape my |
, and I have arrived back at my original input:
["hi","my~","|name is bob"]
You need all the ~
s to be part of the look-behind group, since split
splits on the whole match of the regex, not just a group of it, even if that group is a non-capturing group. A simpler example:
"asdf" split "(?:s)" //Array(a, df)
The look-behind group is not part of the match, so you want to put your prefix criteria in there. Basically, you need to wrap your solution in another look-behind group. Ideally, you'd want:
"""(?<=(?<!~)(~~)*)\|"""
But unfortunately Java doesn't support look-behind groups of arbitrary length. As a workaround, you can do:
"""(?<=(?<!~)(~~){0,10})\|"""
Which would work for even number of ~
s as long as there are 20 or fewer. You could increase 10 if this is a problem.
If the nested look-behinds are confusing, you can also use the equivalent:
"""(?<![^~]~(~~){0,10})\|"""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With