Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting pairs of words using String.split()

Given:

String input = "one two three four five six seven";

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

results in this:

[one two, three four, five six, seven]

This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.

like image 740
Bohemian Avatar asked May 10 '13 15:05

Bohemian


People also ask

What is split () function in string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

What is split () method?

The Split method extracts the substrings in this string that are delimited by one or more of the strings in the separator parameter, and returns those substrings as elements of an array. The Split method looks for delimiters by performing comparisons using case-sensitive ordinal sort rules.

How do I split a string into multiple strings?

split() The method split() splits a String into multiple Strings given the delimiter that separates them. The returned object is an array which contains the split Strings. We can also pass a limit to the number of elements in the returned array.

What does split method does to a string explain with example?

The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


2 Answers

Currently (last tested on Java 17) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses \w+ which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.

Instead use Pattern and Matcher classes with regex like \w+\s+\w+ which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").


Is this what you are looking for?
(you can replace \\w with \\S to include all non-space characters but for this example I will leave \\w since it is easier to read regex with \\w\\s then \\S\\s)

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\\G\\w+)\\s");
System.out.println(Arrays.toString(pairs));

output:

[one two, three four, five six, seven]

\G is previous match, (?<!regex) is negative lookbehind.

In split we are trying to

  1. find spaces -> \\s
  2. that are not predicted -> (?<!negativeLookBehind)
  3. by some word -> \\w+
  4. with previously matched (space) -> \\G
  5. before it ->\\G\\w+.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \\G at start matches start of the String ^.

So before first iteration regex in negative look-behind will look like (?<!^\\w+) and since first space do have ^\\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \\G and used later in next negative look-behind.

So for 3rd space regex will check if there is previously matched space \\G and word \\w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \\G (it will have different position in input String).


Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\\G\\w{1,100}\\s\\w{1,100}\\s\\w{1,100})\\s")

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.


I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\\G\\d+,\\d+,\\d+,\\d+,\\d+),");//every 5th comma 
like image 93
Pshemo Avatar answered Oct 21 '22 11:10

Pshemo


This will work, but maximum word length needs to be set in advance:

String input = "one two three four five six seven eight nine ten eleven";
String[] pairs = input.split("(?<=\\G\\S{1,30}\\s\\S{1,30})\\s");
System.out.println(Arrays.toString(pairs));

I like Pshemo's answer better, being shorter and usable on arbitrary word lengths, but this (as @Pshemo pointed out) has the advantage of being adaptable to groups of more than 2 words.

like image 40
maybeWeCouldStealAVan Avatar answered Oct 21 '22 11:10

maybeWeCouldStealAVan