Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex speed in Java

Some example wallclock times for a large number of strings:

.split("[^a-zA-Z]"); // .44 seconds
.split("[^a-zA-Z]+"); // .47 seconds
.split("\\b+"); // 2 seconds

Any explanations for the dramatic increase? I can imagine the [^a-zA-Z] pattern being done in the processor as a set of four compare operations of which all four happen only if it is a true case. What about the \b? Anybody have anything to weigh in for that?

like image 840
Jeff Ferland Avatar asked Jun 16 '26 09:06

Jeff Ferland


1 Answers

First, it makes no sense to split on one or more zero-width assertions! Java’s regex is not very clever — and I’m being charitable — about sane optimizations.

Second, never use \b in Java: it is messed up and out of sync with \w.

For a more complete explanation of this, especially how to make it work with Unicode, see this answer.

like image 52
tchrist Avatar answered Jun 19 '26 00:06

tchrist



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!