Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split regex to extract Strings of contiguous characters

Tags:

Is there a regex that would work with String.split() to break a String into contiguous characters - ie split where the next character is different to the previous character?

Here's the test case:

String regex = "your answer here"; String[] parts = "aaabbcddeee".split(regex); System.out.println(Arrays.toString(parts)); 

Expected output:

[aaa, bb, c, dd, eee] 

Although the test case has letters only as input, this is for clarity only; input characters may be any character.


Please do not provide "work-arounds" involving loops or other techniques.

The question is to find the right regex for the code as shown above - ie only using split() and no other methods calls. It is not a question about finding code that will "do the job".

like image 815
Bohemian Avatar asked Nov 28 '12 01:11

Bohemian


People also ask

Can we split string using regex?

split(String regex) method splits this string around matches of the given regular expression. This method works in the same way as invoking the method i.e split(String regex, int limit) with the given expression and a limit argument of zero. Therefore, trailing empty strings are not included in the resulting array.

How do I split a string into substrings?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

What does the string split regex method do?

Split(String) Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor.

Is split faster than regex?

split is faster, but complex separators which might involve look ahead, Regex is only option.


1 Answers

It is totally possible to write the regex for splitting in one step:

"(?<=(.))(?!\\1)" 

Since you want to split between every group of same characters, we just need to look for the boundary between 2 groups. I achieve this by using a positive look-behind just to grab the previous character, and use a negative look-ahead and back-reference to check that the next character is not the same character.

As you can see, the regex is zero-width (only 2 look around assertions). No character is consumed by the regex.

like image 125
nhahtdh Avatar answered Nov 24 '22 07:11

nhahtdh