Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confusing output from String.split

I do not understand the output of this code:

public class StringDemo{                   public static void main(String args[]) {         String blank = "";                             String comma = ",";                            System.out.println("Output1: "+blank.split(",").length);           System.out.println("Output2: "+comma.split(",").length);       } } 

And got the following output:

Output1: 1  Output2: 0 
like image 360
sanket patel Avatar asked Jul 31 '14 10:07

sanket patel


People also ask

What does split method does to a string explain with example?

The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

Is string split efficient?

String. split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient. StringTokenizer is not much faster in this particular case.

Does split () alter the original string?

Note: The split() method does not change the original string. Remember – JavaScript strings are immutable. The split method divides a string into a set of substrings, maintaining the substrings in the same order in which they appear in the original string. The method returns the substrings in the form of an array.

Is string Tokenizer faster than split?

The split() method is preferred and recommended even though it is comparatively slower than StringTokenizer. This is because it is more robust and easier to use than StringTokenizer.


2 Answers

Documentation:

For: System.out.println("Output1: "+blank.split(",").length);

The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

It will simply return the entire string that's why it returns 1.


For the second case, String.split will discard the , so the result will be empty.

String.split silently discards trailing separators 

see guava StringsExplained too

like image 178
Marco Acierno Avatar answered Sep 23 '22 09:09

Marco Acierno


Everything happens according to plan, but let's do it step by step (I hope you have some time).

According to documentation (and source code) of split(String regex) method:

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.

So when you invoke

split(String regex) 

you are actually getting result from the split(String regex, int limit) method which is invoked in a way:

split(regex, 0) 

So here limit is set to 0.

You need to know a few things about this parameter:

  • If limit is positive you are limiting length of result array to a positive number you specified, so "axaxaxaxa".split("x",2) will return an array, ["a", "axaxaxa"], not ["a","a","a","a","a"].
  • If limit is 0 then you are not limiting the length of the result array. But it also means that any trailing empty strings will be removed. For example:

    "fooXbarX".split("X") 

    will at start generate an array which will look like:

    ["foo", "bar", ""] 

    ("barX" split on "X" generates "bar" and ""), but since split removes all trailing empty string, it will return

    ["foo", "bar"] 
  • Behaviour of negative value of limit is similar to behaviour where limit is set to 0 (it will not limit length of result array). The only difference is that it will not remove empty strings from the end of the result array. In other words

    "fooXbarX".split("X",-1) 

will return ["foo", "bar", ""]


Lets take a look at the case,

",".split(",").length 

which (as explained earlier) is same as

",".split(",", 0).length 

This means that we are using a version of split which will not limit the length of the result array, but will remove all trailing empty strings, "". You need to understand that when we split one thing we are always getting two things.

In other words, if we split "abc" in place of b, we will get "a" and "c".
The tricky part is to understand that if we split "abc" in c we will get "ab" and "" (empty string).

Using this logic, if we split "," on , we will get "" and "" (two empty strings).

You can check it using split with negative limit:

for (String s: ",".split(",", -1)){     System.out.println("\""+s+"\""); } 

which will print

"" "" 

So as we see result array here is at first ["", ""].

But since by default we are using limit set to 0, all trailing empty strings will be removed. In this case, the result array contains only trailing empty strings, so all of them will be removed, leaving you with empty array [] which has length 0.


To answer the case with

"".split(",").length 

you need to understand that removing trailing empty strings makes sense only if such trailing empty strings ware result of splitting (and most probably are not needed).
So if there were not any places on which we could split, there is no chance that empty strings ware created, so there is no point in running this "cleaning" process.

This information is mentioned in documentation of split(String regex, int limit) method where you can read:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

You can also see this behaviour in source code of this method (from Java 8):

2316      public String[] split(String regex, int limit) {
2317 /* fastpath if the regex is a
2318 (1)one-char String and this character is not one of the
2319 RegEx's meta characters ".$|()[{^?*+\\", or
2320 (2)two-char String and the first char is the backslash and
2321 the second is not the ascii digit or ascii letter.
2322 */
2323 char ch = 0;
2324 if (((regex.value.length == 1 &&
2325 ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
2326 (regex.length() == 2 &&
2327 regex.charAt(0) == '\\' &&
2328 (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
2329 ((ch-'a')|('z'-ch)) < 0 &&
2330 ((ch-'A')|('Z'-ch)) < 0)) &&
2331 (ch < Character.MIN_HIGH_SURROGATE ||
2332 ch > Character.MAX_LOW_SURROGATE))
2333 {
2334 int off = 0;
2335 int next = 0;
2336 boolean limited = limit > 0;
2337 ArrayList<String> list = new ArrayList<>();
2338 while ((next = indexOf(ch, off)) != -1) {
2339 if (!limited || list.size() < limit - 1) {
2340 list.add(substring(off, next));
2341 off = next + 1;
2342 } else { // last one
2343 //assert (list.size() == limit - 1);
2344 list.add(substring(off, value.length));
2345 off = value.length;
2346 break;
2347 }
2348 }
2349 // If no match was found, return this
2350 if (off == 0)
2351 return new String[]{this};
2353 // Add remaining segment
2354 if (!limited || list.size() < limit)
2355 list.add(substring(off, value.length));
2357 // Construct result
2358 int resultSize = list.size();
2359 if (limit == 0) {
2360 while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
2361 resultSize--;
2362 }
2363 }
2364 String[] result = new String[resultSize];
2365 return list.subList(0, resultSize).toArray(result);
2366 }
2367 return Pattern.compile(regex).split(this, limit);
2368 }

where you can find

if (off == 0)     return new String[]{this}; 

fragment which means

  • if (off == 0) - if off (position from which method should start searching for next possible match for regex passed as split argument) is still 0 after iterating over entire string, we didn't find any match, so the string was not split
  • return new String[]{this}; - in that case let's just return an array with original string (represented by this).

Since "," couldn't be found in "" even once, "".split(",") must return an array with one element (empty string on which you invoked split). This means that the length of this array is 1.

BTW. Java 8 introduced another mechanism. It removes leading empty strings (if they ware created while splitting process) if we split using zero-length regex (like "" or with look-around (?<!x)). More info at: Why in Java 8 split sometimes removes empty strings at start of result array?

like image 41
Pshemo Avatar answered Sep 20 '22 09:09

Pshemo