Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is Guava Splitter.onPattern(..).split() different from String.split(..)?

I recently harnessed the power of a look-ahead regular expression to split a String:

"abc8".split("(?=\\d)|\\W") 

If printed to the console this expression returns:

[abc, 8] 

Very pleased with this result, I wanted to transfer this to Guava for further development, which looked like this:

Splitter.onPattern("(?=\\d)|\\W").split("abc8") 

To my surprise the output changed to:

[abc] 

Why?

like image 811
Fritz Duchardt Avatar asked Jun 19 '15 15:06

Fritz Duchardt


People also ask

How can you split a character having the combination of string special characters and numbers in Java?

String myString = "Jane-Doe"; String[] splitString = myString. split("-"); We can simply use a character/substring instead of an actual regular expression. Of course, there are certain special characters in regex which we need to keep in mind, and escape them in case we want their literal value.

What is splitter in Java?

Java split() function is used to splitting the string into the string array based on the regular expression or the given delimiter. The resultant object is an array contains the split strings. In the resultant returned array, we can pass the limit to the number of elements.


1 Answers

You found a bug!

System.out.println(s.split("abc82")); // [abc, 8] System.out.println(s.split("abc8"));  // [abc] 

This is the method that Splitter uses to actually split Strings (Splitter.SplittingIterator::computeNext):

@Override protected String computeNext() {   /*    * The returned string will be from the end of the last match to the    * beginning of the next one. nextStart is the start position of the    * returned substring, while offset is the place to start looking for a    * separator.    */   int nextStart = offset;   while (offset != -1) {     int start = nextStart;     int end;      int separatorPosition = separatorStart(offset);      if (separatorPosition == -1) {       end = toSplit.length();       offset = -1;     } else {       end = separatorPosition;       offset = separatorEnd(separatorPosition);     }      if (offset == nextStart) {       /*        * This occurs when some pattern has an empty match, even if it        * doesn't match the empty string -- for example, if it requires        * lookahead or the like. The offset must be increased to look for        * separators beyond this point, without changing the start position        * of the next returned substring -- so nextStart stays the same.        */       offset++;       if (offset >= toSplit.length()) {         offset = -1;       }       continue;     }      while (start < end && trimmer.matches(toSplit.charAt(start))) {       start++;     }     while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {       end--;     }      if (omitEmptyStrings && start == end) {       // Don't include the (unused) separator in next split string.       nextStart = offset;       continue;     }      if (limit == 1) {       // The limit has been reached, return the rest of the string as the       // final item.  This is tested after empty string removal so that       // empty strings do not count towards the limit.       end = toSplit.length();       offset = -1;       // Since we may have changed the end, we need to trim it again.       while (end > start && trimmer.matches(toSplit.charAt(end - 1))) {         end--;       }     } else {       limit--;     }      return toSplit.subSequence(start, end).toString();   }   return endOfData(); } 

The area of interest is:

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset >= toSplit.length()) {     offset = -1;   }   continue; } 

This logic works great, unless the empty match happens at the end of a String. If the empty match does occur at the end of a String, it will end up skipping that character. What this part should look like is (notice >= -> >):

if (offset == nextStart) {   /*    * This occurs when some pattern has an empty match, even if it    * doesn't match the empty string -- for example, if it requires    * lookahead or the like. The offset must be increased to look for    * separators beyond this point, without changing the start position    * of the next returned substring -- so nextStart stays the same.    */   offset++;   if (offset > toSplit.length()) {     offset = -1;   }   continue; } 
like image 191
Jeffrey Avatar answered Sep 25 '22 15:09

Jeffrey