Some example wallclock times for a large number of strings:
.split("[^a-zA-Z]"); // .44 seconds
.split("[^a-zA-Z]+"); // .47 seconds
.split("\\b+"); // 2 seconds
Any explanations for the dramatic increase? I can imagine the [^a-zA-Z] pattern being done in the processor as a set of four compare operations of which all four happen only if it is a true case. What about the \b? Anybody have anything to weigh in for that?
First, it makes no sense to split on one or more zero-width assertions! Java’s regex is not very clever — and I’m being charitable — about sane optimizations.
Second, never use \b in Java: it is messed up and out of sync with \w.
For a more complete explanation of this, especially how to make it work with Unicode, see this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With