<p><strong>Before Java 8</strong> when we split on empty string like</p> <pre class="prettyprint"><code>String[] tokens = "abc".split(""); </code></pre> <p>split mechanism would split in places marked with <code>|</code></p> <pre class="prettyprint"><code>|a|b|c| </code></pre> <p>because empty space <code>""</code> exists before and after each character. So as result it would generate at first this array</p> <pre class="prettyprint"><code>["", "a", "b", "c", ""] </code></pre> <p>and later will remove trailing empty strings (because we didn't explicitly provide negative value to <code>limit</code> argument) so it will finally return</p> <pre class="prettyprint"><code>["", "a", "b", "c"] </code></pre> <hr> <p><strong>In Java 8</strong> split mechanism seems to have changed. Now when we use</p> <pre class="prettyprint"><code>"abc".split("") </code></pre> <p>we will get <code>["a", "b", "c"]</code> array instead of <code>["", "a", "b", "c"]</code>.</p> <p>My first guess was that maybe now <em>leading</em> empty strings are also removed just like <em>trailing</em> empty strings.</p> <p>But this theory fails, since</p> <pre class="prettyprint"><code>"abc".split("a") </code></pre> <p>returns <code>["", "bc"]</code>, so leading empty string was not removed.</p> <p>Can someone explain what is going on here? How rules of <code>split</code> have changed in Java 8?</p>

<p>The behavior of <code>String.split</code> (which calls <code>Pattern.split</code>) changes between Java 7 and Java 8.</p> <h3>Documentation</h3> <p>Comparing between the documentation of <code>Pattern.split</code> in Java 7 and Java 8, we observe the following clause being added:</p> <blockquote> <p>When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.</p> </blockquote> <p>The same clause is also added to <code>String.split</code> in Java 8, compared to Java 7.</p> <h3>Reference implementation</h3> <p>Let us compare the code of <code>Pattern.split</code> of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.</p> <h3>Java 7</h3> <pre class="prettyprint"><code>public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { String match = input.subSequence(index, m.start()).toString(); matchList.add(match); index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Add remaining segment if (!matchLimited || matchList.size() < limit) matchList.add(input.subSequence(index, input.length()).toString()); // Construct result int resultSize = matchList.size(); if (limit == 0) while (resultSize > 0 && matchList.get(resultSize-1).equals("")) resultSize--; String[] result = new String[resultSize]; return matchList.subList(0, resultSize).toArray(result); } </code></pre> <h3>Java 8</h3> <pre class="prettyprint"><code>public String[] split(CharSequence input, int limit) { int index = 0; boolean matchLimited = limit > 0; ArrayList<String> matchList = new ArrayList<>(); Matcher m = matcher(input); // Add segments before each match found while(m.find()) { if (!matchLimited || matchList.size() < limit - 1) { if (index == 0 && index == m.start() && m.start() == m.end()) { // no empty leading substring included for zero-width match // at the beginning of the input char sequence. continue; } String match = input.subSequence(index, m.start()).toString(); matchList.add(match); index = m.end(); } else if (matchList.size() == limit - 1) { // last one String match = input.subSequence(index, input.length()).toString(); matchList.add(match); index = m.end(); } } // If no match was found, return this if (index == 0) return new String[] {input.toString()}; // Add remaining segment if (!matchLimited || matchList.size() < limit) matchList.add(input.subSequence(index, input.length()).toString()); // Construct result int resultSize = matchList.size(); if (limit == 0) while (resultSize > 0 && matchList.get(resultSize-1).equals("")) resultSize--; String[] result = new String[resultSize]; return matchList.subList(0, resultSize).toArray(result); } </code></pre> <p>The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.</p> <pre class="prettyprint"><code> if (index == 0 && index == m.start() && m.start() == m.end()) { // no empty leading substring included for zero-width match // at the beginning of the input char sequence. continue; } </code></pre> <h3>Maintaining compatibility</h3> <h3>Following behavior in Java 8 and above</h3> <p>To make <code>split</code> behaves consistently across versions and compatible with the behavior in Java 8:</p> <ol> <li>If your regex <strong>can</strong> match zero-length string, just add <code>(?!\A)</code> at <strong>the end</strong> of the regex and wrap the original regex in non-capturing group <code>(?:...)</code> (if necessary).</li> <li>If your regex <strong>can't</strong> match zero-length string, you don't need to do anything.</li> <li>If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.</li> </ol> <p><code>(?!\A)</code> checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.</p> <h3>Following behavior in Java 7 and prior</h3> <p>There is no general solution to make <code>split</code> backward-compatible with Java 7 and prior, short of replacing all instance of <code>split</code> to point to your own custom implementation.</p>

Why in Java 8 split sometimes removes empty strings at start of result array?

Tags:

java

regex

split

java-8

Before Java 8 when we split on empty string like

String[] tokens = "abc".split("");

split mechanism would split in places marked with |

|a|b|c|

because empty space "" exists before and after each character. So as result it would generate at first this array

["", "a", "b", "c", ""]

and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return

["", "a", "b", "c"]

In Java 8 split mechanism seems to have changed. Now when we use

"abc".split("")

we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].

My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.

But this theory fails, since

"abc".split("a")

returns ["", "bc"], so leading empty string was not removed.

Can someone explain what is going on here? How rules of split have changed in Java 8?

376

asked Mar 28 '14 16:03

Pshemo

1 Answers

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

The same clause is also added to String.split in Java 8, compared to Java 7.

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

Java 7

public String[] split(CharSequence input, int limit) {     int index = 0;     boolean matchLimited = limit > 0;     ArrayList<String> matchList = new ArrayList<>();     Matcher m = matcher(input);      // Add segments before each match found     while(m.find()) {         if (!matchLimited || matchList.size() < limit - 1) {             String match = input.subSequence(index, m.start()).toString();             matchList.add(match);             index = m.end();         } else if (matchList.size() == limit - 1) { // last one             String match = input.subSequence(index,                                              input.length()).toString();             matchList.add(match);             index = m.end();         }     }      // If no match was found, return this     if (index == 0)         return new String[] {input.toString()};      // Add remaining segment     if (!matchLimited || matchList.size() < limit)         matchList.add(input.subSequence(index, input.length()).toString());      // Construct result     int resultSize = matchList.size();     if (limit == 0)         while (resultSize > 0 && matchList.get(resultSize-1).equals(""))             resultSize--;     String[] result = new String[resultSize];     return matchList.subList(0, resultSize).toArray(result); }

Java 8

public String[] split(CharSequence input, int limit) {     int index = 0;     boolean matchLimited = limit > 0;     ArrayList<String> matchList = new ArrayList<>();     Matcher m = matcher(input);      // Add segments before each match found     while(m.find()) {         if (!matchLimited || matchList.size() < limit - 1) {             if (index == 0 && index == m.start() && m.start() == m.end()) {                 // no empty leading substring included for zero-width match                 // at the beginning of the input char sequence.                 continue;             }             String match = input.subSequence(index, m.start()).toString();             matchList.add(match);             index = m.end();         } else if (matchList.size() == limit - 1) { // last one             String match = input.subSequence(index,                                              input.length()).toString();             matchList.add(match);             index = m.end();         }     }      // If no match was found, return this     if (index == 0)         return new String[] {input.toString()};      // Add remaining segment     if (!matchLimited || matchList.size() < limit)         matchList.add(input.subSequence(index, input.length()).toString());      // Construct result     int resultSize = matchList.size();     if (limit == 0)         while (resultSize > 0 && matchList.get(resultSize-1).equals(""))             resultSize--;     String[] result = new String[resultSize];     return matchList.subList(0, resultSize).toArray(result); }

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

            if (index == 0 && index == m.start() && m.start() == m.end()) {                 // no empty leading substring included for zero-width match                 // at the beginning of the input char sequence.                 continue;             }

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
If your regex can't match zero-length string, you don't need to do anything.
If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

196

answered Oct 22 '22 20:10

nhahtdh

Related questions
                            
                                Storing a Map<String,String> using JPA
                            
                                How to differentiate between time to live and time to idle in ehcache
                            
                                Java equivalents of C# String.Format() and String.Join()
                            
                                What's wrong with Java Date & Time API? [closed]
                            
                                Java, How to get number of messages in a topic in apache kafka
                            
                                Calendar returns wrong month [duplicate]
                            
                                Assign variable value inside if-statement [duplicate]
                            
                                download a file from Spring boot rest service
                            
                                How do I find out if first character of a string is a number?
                            
                                Changing java platform on which netbeans runs
                            
                                Must qualify the allocation with an enclosing instance of type GeoLocation
                            
                                Implementation difference between Aggregation and Composition in Java
                            
                                Current time in microseconds in java
                            
                                How to exit an Android app programmatically?
                            
                                How to see if an object is an array without using reflection?
                            
                                Android JSONObject - How can I loop through a flat JSON object to get each key and value
                            
                                Android: how to hide ActionBar on certain activities
                            
                                How do you embed binary data in XML?
                            
                                Is there a Java API that can create rich Word documents? [closed]
                            
                                Why not abstract fields?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With