Why are most string manipulations in Java based on regexp?

Tags:

In Java there are a bunch of methods that all have to do with manipulating Strings. The simplest example is the String.split("something") method.

Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.

Now there are two effects you'll see in many of those methods:

They recompile the expression each time the method is invoked. As such they impose a performance impact.
I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&') to split by.

So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.

So I was wondering: why was the Java API chosen the way it is now? What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??

Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.

Short summary: It makes a big difference!

I did 10000000 iterations for each test case, always using the input

"aap,noot,mies,wim,zus,jet,teun"

and always using ',' or "," as the split argument.

This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):

fastSplit STRING Test  1 : 11405 milliseconds: Split in several pieces Test  2 :  3018 milliseconds: Split in 2 pieces Test  3 :  4396 milliseconds: Split in 3 pieces  homegrown fast splitter based on char Test  4 :  9076 milliseconds: Split in several pieces Test  5 :  2024 milliseconds: Split in 2 pieces Test  6 :  2924 milliseconds: Split in 3 pieces  homegrown splitter based on char that always splits in 2 pieces Test  7 :  1230 milliseconds: Split in 2 pieces  String.split(regex) Test  8 : 32913 milliseconds: Split in several pieces Test  9 : 30072 milliseconds: Split in 2 pieces Test 10 : 31278 milliseconds: Split in 3 pieces  String.split(regex) using precompiled Pattern Test 11 : 26138 milliseconds: Split in several pieces  Test 12 : 23612 milliseconds: Split in 2 pieces Test 13 : 24654 milliseconds: Split in 3 pieces  StringTokenizer Test 14 : 27616 milliseconds: Split in several pieces Test 15 : 28121 milliseconds: Split in 2 pieces Test 16 : 27739 milliseconds: Split in 3 pieces

As you can see it makes a big difference if you have a lot of "fixed char" splits to do.

To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)

Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.

My conclusions so far:

Only optimize this if you have a LOT of strings to split.
If you use the regex methods always precompile if you repeatedly use the same pattern.
Forget the (obsolete) StringTokenizer
If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).

P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.

private static String[]         stringSplitChar(final String input,                         final char separator) {     int pieces = 0;      // First we count how many pieces we will need to store ( = separators + 1 )     int position = 0;     do {         pieces++;         position = input.indexOf(separator, position + 1);     } while (position != -1);      // Then we allocate memory     final String[] result = new String[pieces];      // And start cutting and copying the pieces.     int previousposition = 0;     int currentposition = input.indexOf(separator);     int piece = 0;     final int lastpiece = pieces - 1;     while (piece < lastpiece) {         result[piece++] = input.substring(previousposition, currentposition);         previousposition = currentposition + 1;         currentposition = input.indexOf(separator, previousposition);     }     result[piece] = input.substring(previousposition);      return result; }  private static String[]         stringSplitChar(final String input,                         final char separator,                         final int maxpieces) {     if (maxpieces <= 0) {         return stringSplitChar(input, separator);     }     int pieces = maxpieces;      // Then we allocate memory     final String[] result = new String[pieces];      // And start cutting and copying the pieces.     int previousposition = 0;     int currentposition = input.indexOf(separator);     int piece = 0;     final int lastpiece = pieces - 1;     while (currentposition != -1 && piece < lastpiece) {         result[piece++] = input.substring(previousposition, currentposition);         previousposition = currentposition + 1;         currentposition = input.indexOf(separator, previousposition);     }     result[piece] = input.substring(previousposition);      // All remaining array elements are uninitialized and assumed to be null     return result; }  private static String[]         stringChop(final String input,                    final char separator) {     String[] result;     // Find the separator.     final int separatorIndex = input.indexOf(separator);     if (separatorIndex == -1) {         result = new String[1];         result[0] = input;     }     else {         result = new String[2];         result[0] = input.substring(0, separatorIndex);         result[1] = input.substring(separatorIndex + 1);     }     return result; }

461

asked Jul 29 '10 12:07

Niels Basjes

2 Answers

Note that the regex need not be recompiled each time. From the Javadoc:

An invocation of this method of the form str.split(regex, n) yields the same result as the expression

Pattern.compile(regex).split(str, n)

That is, if you are worried about performance, you may precompile the pattern and then reuse it:

Pattern p = Pattern.compile(regex); ... String[] tokens1 = p.split(str1);  String[] tokens2 = p.split(str2);  ...

instead of

String[] tokens1 = str1.split(regex); String[] tokens2 = str2.split(regex); ...

I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.

My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.

It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

answered Sep 30 '22 14:09

Péter Török

I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.

Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.

It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.

@Tim Pietzcker:

Are there other languages that do the same?

JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:

s.split(findstr).join(replacestr)

Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.

(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)

answered Sep 30 '22 15:09

bobince

Related questions
                            
                                Limiting the max size of a HashMap in Java
                            
                                How can I wait for 10 second without locking application UI in android [duplicate]
                            
                                When is it desired to not implement toString() in Java?
                            
                                Native crash of RenderThread with signal 6 (SIGABRT)
                            
                                Is there a way to split strings with String.split() and include the delimiters? [duplicate]
                            
                                How to download spring framework zip file [duplicate]
                            
                                Cascading examples failed to compile?
                            
                                USB bulkTransfer between Android tablet and camera
                            
                                PDF find out if text is underlined or a table cell
                            
                                Are there good reasons for a public constructor of an abstract class
                            
                                Immutability and reordering
                            
                                How do I use Comet with Spring MVC?
                            
                                Ordering unit tests in Eclipse's JUnit view
                            
                                Using Java with Microsoft Visual Studio 2012
                            
                                Spring Security custom authentication filter using Java Config
                            
                                How to choose between JAX-RS and JAX-WS web services implementation?
                            
                                Closing URLConnection and InputStream correctly?
                            
                                Maven compile: package does not exist
                            
                                How are java interfaces implemented internally? (vtables?)
                            
                                Gradle compileJava task warning: [options] bootstrap class path not set in conjunction with -source 1.6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are most string manipulations in Java based on regexp?

Tags:

java

string

regex

language-design

Niels Basjes

People also ask

2 Answers

Péter Török

bobince

Recent Activity

Donate For Us