Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does string.split("\\S") work [duplicate]

Tags:

java

regex

ocpjp

I was doing a question out of the book oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 by Ganesh and Sharma.

One question is:

  1. Consider the following program and predict the output:

      class Test {
    
        public static void main(String args[]) {
          String test = "I am preparing for OCPJP";
          String[] tokens = test.split("\\S");
          System.out.println(tokens.length);
        }
      }
    

    a) 0

    b) 5

    c) 12

    d) 16

Now I understand that \S is a regex means treat non-space chars as the delimiters. But I was puzzled as to how the regex expression does its matching and what are the actual tokens produced by split.

I added code to print out the tokens as follows

for (String str: tokens){
  System.out.println("<" + str + ">");
}

and I got the following output

16

<>

< >

<>

< >

<>

<>

<>

<>

<>

<>

<>

<>

< >

<>

<>

< >

So a lot of empty string tokens. I just do not understand this.

I would have thought along the lines that if delimiters are non space chars that in the above text then all alphabetic chars serve as delimiters so maybe there should be 21 tokens if we are matching tokens that result in empty strings too. I just don't understand how Java's regex engine is working this out. Are there any regex gurus out there who can shed light on this code for me?

like image 806
Frank Brosnan Avatar asked Oct 09 '14 14:10

Frank Brosnan


People also ask

What does split \\ s do?

Definition and Usage The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

Does split change the original string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

How do I split a string into multiple spaces?

To split a string by multiple spaces, call the split() method, passing it a regular expression, e.g. str. trim(). split(/\s+/) . The regular expression will split the string on one or more spaces and return an array containing the substrings.

Why split is not working for dot in Java?

backslash-dot is invalid because Java doesn't need to escape the dot. You've got to escape the escape character to get it as far as the regex which is used to split the string.


2 Answers

Copied from the API documentation: (bold are mine)

public String[] split(String regex)

Splits this string around matches of the given regular expression. This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

The string "boo:and:foo", for example, yields the following results with these expressions:

 Regex  Result
   :    { "boo", "and", "foo" }
   o    { "b", "", ":and:f" }

Check the second example, where last 2 "o" are just removed: the answer for your question is "OCPJP" substring is treated as a collection of separators which is not followed for non-empty strings, so that part is trimmed.

like image 150
Pablo Lozano Avatar answered Oct 11 '22 06:10

Pablo Lozano


The reason the result is 16 and not 21 is this, from the javadoc for Split:

Trailing empty strings are therefore not included in the resulting array.

This means, for example, that if you say

"/abc//def/ghi///".split("/")

the result will have five elements. The first will be "", since it's not a trailing empty string; the others will be "abc", "", "def", and "ghi". But the remaining empty strings are removed from the array.

In the posted case:

"I am preparing for OCPJP".split("\\S")

it's the same thing. Since non-space characters are delimiters, each letter is a delimiter, but the OCPJP letters essentially don't count, because those delimiters result in trailing empty strings that are then discarded. So, since there are 15 letters in "I am preparing for", they are treated as delimiting 16 substrings (the first is "" and the last is " ").

like image 34
ajb Avatar answered Oct 11 '22 07:10

ajb