I understand that Java regular expressions need to be compiled in order to do any type of regular expression pattern matching on strings, but I don't understand why they need to be compiled.
What the is the more efficient representation that a regex string is compiled to? And how is that representation more efficient than a String?
In general, Regular Expression engines use a set of instructions to know how to walk through the target text and match portions of it. The high level (human-readable) pattern that we as developers write is like your source code in Java (or really any other language). The computer does not run your source code, it compiles that into instructions the computer can understand. Similarly, your RegEx pattern is compiled into a set of instructions that the RegEx engine (regardless of programming language) can process.
I personally find the Regular-Expressions.info site very helpful for lots of explanations, although their explanation of how the engine works internally is a bit light. This answer on SO is decent, with some other links.
If you want a more in depth answer, I would look at this page which talks about the nature of Regular Expression engines, which is that they are finite state machines.
Regular expression engines are implemented as finite state machines (FSM). The pattern you supply is compiled into a data structure that represents this state machine.
When you match a string against this pattern, the regex engine takes each character and decides the state transition within the FSM. If there are no valid state transitions for an input character the match fails.
One of the states in the FSM is a terminating/end state. If the regex engine gets there it reports success.
To answer your "how is that more efficient than a string" question, it can't be a string... you have to get the low-level instructions for the engine. A String type isn't a set of instructions!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With