I am trying to use regular expressions to determine what format the user have applied when entering input in a textbox.
The regular expressions are as follows:
(\\s?[" + alphabet + "]{9,9})+
To determine whether the input is one or more strings of length 9 in a given alphabet, possibly separated by whitespace.
(>[\\w\\s]+\\n[" + alphabet + "\\s]+)+
To check if the input is in FASTA format
The regular expressions run terribly slow when matching with inputString.matches(regexString)
. Why is this?
I figured this may be due to Java storing all potential matches (which I don't need at this point), but adding ?:
in every parenthesis breaks the regex. How should this be done?
Thank you,
Martin
Edit 1: I was unable to reproduce this issue - it only happens on one computer. This could suggest something wrong with that particular VM setup.
We need something more robust, and so we will be implementing this differently. I have picked Joel's answer as the right one, since I believe that some special case in Pattern may be the cause.
string.matches() compile the regex every time you do it. Instead, look at the Pattern/Matcher classes, which allow you to cache precompiled regexes.
Another thing is to use non-capturing regex groups if you don't need the result of the matching.
this might not explain your particular problem. but once I dived into JDK's regex implementation, and I was surprised at how unsophisticated it is. it doesn't really build a state machine that advances at each input char. I assume they have their reasons.
in your case, it is so easy to write a parse by yourself, by hand. people fear to do that, it seems "dumb" to manually code these tiny steps, and people think established libraries must be doing some splendid tricks to outperform home grown solutions. that's not true. in many cases, our needs are rather simple, and it is simpler and faster to DIY.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With