Given:
String input = "one two three four five six seven";
Is there a regex that works with String.split()
to grab (up to) two words at a time, such that:
String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));
results in this:
[one two, three four, five six, seven]
This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
The Split method extracts the substrings in this string that are delimited by one or more of the strings in the separator parameter, and returns those substrings as elements of an array. The Split method looks for delimiters by performing comparisons using case-sensitive ordinal sort rules.
split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.
The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.
Currently (last tested on Java 17) it is possible to do it with split()
, but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+
which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.
Instead use Pattern
and Matcher
classes with regex like \w+\s+\w+
which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").
Is this what you are looking for?
(you can replace \\w
with \\S
to include all non-space characters but for this example I will leave \\w
since it is easier to read regex with \\w\\s
then \\S\\s
)
String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));
output:
[one two, three four, five six, seven]
\G
is previous match, (?<!regex)
is negative lookbehind.
In split
we are trying to
\\s
(?<!negativeLookBehind)
\\w+
\\G
\\G\\w+
.Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G
at start matches start of the String ^
.
So before first iteration regex in negative look-behind will look like (?<!^\\w+)
and since first space do have ^\\w+
before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input
String) will be stored in \\G
and used later in next negative look-behind.
So for 3rd space regex will check if there is previously matched space \\G
and word \\w+
before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G
(it will have different position in input
String).
Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)
input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")
Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.
I just noticed that we can also use +
instead of {1,maxWordLength}
if we want to split with every odd number like every 3rd, 5th, 7th for example
String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma
This will work, but maximum word length needs to be set in advance:
String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));
I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With