Before Java 8 when we split on empty string like
String[] tokens = "abc".split("");
split mechanism would split in places marked with |
|a|b|c|
because empty space ""
exists before and after each character. So as result it would generate at first this array
["", "a", "b", "c", ""]
and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit
argument) so it will finally return
["", "a", "b", "c"]
In Java 8 split mechanism seems to have changed. Now when we use
"abc".split("")
we will get ["a", "b", "c"]
array instead of ["", "a", "b", "c"]
.
My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.
But this theory fails, since
"abc".split("a")
returns ["", "bc"]
, so leading empty string was not removed.
Can someone explain what is going on here? How rules of split
have changed in Java 8?
The natural consequence is that if the string does not contain the delimiter, a singleton array containing just the input string is returned, Second, remove all the rightmost empty strings. This is the reason ",,,". split(",") returns empty array.
If the delimiter is an empty string, the split() method will return an array of elements, one element for each character of string. If you specify an empty string for string, the split() method will return an empty string and not an array of strings.
If the specified separator is not found, then returns the string itself as a first element and two empty string elements.
Note: The split() method does not change the original string. Remember – JavaScript strings are immutable. The split method divides a string into a set of substrings, maintaining the substrings in the same order in which they appear in the original string. The method returns the substrings in the form of an array.
The behavior of String.split
(which calls Pattern.split
) changes between Java 7 and Java 8.
Comparing between the documentation of Pattern.split
in Java 7 and Java 8, we observe the following clause being added:
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
The same clause is also added to String.split
in Java 8, compared to Java 7.
Let us compare the code of Pattern.split
of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { String match = input.subSequence(index, m.start()).toString(); matchList.add(match); index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Add remaining segment if (!matchLimited || matchList.size() < limit) matchList.add(input.subSequence(index, input.length()).toString()); // Construct result int resultSize = matchList.size(); if (limit == 0) while (resultSize > 0 && matchList.get(resultSize-1).equals("")) resultSize--; String[] result = new String[resultSize]; return matchList.subList(0, resultSize).toArray(result); }
public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { if (index == 0 && index == m.start() && m.start() == m.end()) { // no empty leading substring included for zero-width match // at the beginning of the input char sequence. continue; } String match = input.subSequence(index, m.start()).toString(); matchList.add(match); index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Add remaining segment if (!matchLimited || matchList.size() < limit) matchList.add(input.subSequence(index, input.length()).toString()); // Construct result int resultSize = matchList.size(); if (limit == 0) while (resultSize > 0 && matchList.get(resultSize-1).equals("")) resultSize--; String[] result = new String[resultSize]; return matchList.subList(0, resultSize).toArray(result); }
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
if (index == 0 && index == m.start() && m.start() == m.end()) { // no empty leading substring included for zero-width match // at the beginning of the input char sequence. continue; }
To make split
behaves consistently across versions and compatible with the behavior in Java 8:
(?!\A)
at the end of the regex and wrap the original regex in non-capturing group (?:...)
(if necessary).(?!\A)
checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
There is no general solution to make split
backward-compatible with Java 7 and prior, short of replacing all instance of split
to point to your own custom implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With