Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern.split slower than String.split

There are two methods:

private static void normalSplit(String base){
    base.split("\\.");
}

private static final Pattern p = Pattern.compile("\\.");

private static void patternSplit(String base){
    //use the static field above
    p.split(base);

}

And I test them like this in the main method:

public static void main(String[] args) throws Exception{
    long start = System.currentTimeMillis();
    String longstr = "a.b.c.d.e.f.g.h.i.j";//use any long string you like
    for(int i=0;i<300000;i++){
        normalSplit(longstr);//switch to patternSplit to see the difference
    }
    System.out.println((System.currentTimeMillis()-start)/1000.0);
}

Intuitively,I think as String.split will eventually call Pattern.compile.split (after a lot of extra work) to do the real thing. I can construct the Pattern object in advance (it is thread safe) and speed up the splitting.

But the fact is, using the pre-constructed Pattern is much slower than calling String.split directly. I tried a 50-character-long string on them (using MyEclipse), the direct call consumes only half the time of using pre-constructed Pattern object.

Please can someone tell me why this happens ?

like image 647
Eddie Deng Avatar asked Apr 01 '15 14:04

Eddie Deng


People also ask

Is regex faster than string split?

Regex will work faster in execution, however Regex's compile time and setup time will be more in instance creation. But if you keep your regex object ready in the beginning, reusing same regex to do split will be faster. String.

Is string split efficient?

String. split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient.


2 Answers

This may depend on the actual implementation of Java. I'm using OpenJDK 7, and here, String.split does indeed invoke Pattern.compile(regex).split(this, limit), but only if the string to split by, regex, is more than a single character.

See here for the source code, line 2312.

public String[] split(String regex, int limit) {
   /* fastpath if the regex is a
      (1)one-char String and this character is not one of the
         RegEx's meta characters ".$|()[{^?*+\\", or
      (2)two-char String and the first char is the backslash and
         the second is not the ascii digit or ascii letter.
   */
   char ch = 0;
   if (((regex.count == 1 &&
       // a bunch of other checks and lots of low-level code
       return list.subList(0, resultSize).toArray(result);
   }
   return Pattern.compile(regex).split(this, limit);
}

As you are splitting by "\\.", it is using the "fast path". That is, if you are using OpenJDK.

like image 166
tobias_k Avatar answered Sep 21 '22 11:09

tobias_k


This is the change in String.split behaviour, which was made in Java 7. This is what we have in 7u40:

public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     (1)one-char String and this character is not one of the
        RegEx's meta characters ".$|()[{^?*+\\", or
     (2)two-char String and the first char is the backslash and
        the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        //do stuff
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}

And this is what we had in 6-b14

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}
like image 34
nikis Avatar answered Sep 23 '22 11:09

nikis