I have been using Regular Expressions for a couple years now and feel comfortable with them, but I was wondering if there are any limitations when using them. I know about the limitations related to recursion (discussed here http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx ). Are there any limitations related to memory? I assume you can capture a string as large as you can fit in memory (or that the VM will allow you to).
Are there any other limitations with regex's that I should know about?
Thanks in advance,
Chris
Regexs can only parse regular grammers anything context-free and higher you need a stack (i.e. a real parser).
That is their only real limitation, performance depends the particular implementation, but generally is slow even precompiled compared to a state machine.
Ginormous regexes can be quite slow and memory hungry. I know, because I have created one. It can tokenize what shouldn't be tokenized by a regex. :-) if you want a link... Now... I haven't ever benchmarked "small" regexes so I don't know their speed. They surely are compact to write.
Ah I was forgetting, regexes are The Evil. Their main problem is that they are like an hammer and when you have them, you try to make all the problems be like a nail. So their main problem is in the user (the programmer).
First "big" limitation: Javascript implements only a subset of them, with no Unicode support. Normally the language you use server side has a more complete implementation, so you get limited by js. Even quite complete implementations like the .NET one have big limits: no support for surrogate pairs and no support for "composed" characters (characters that use combining mark). But, as always, the problem is in the programmer. How many programmers that know Unicode know the intricacies of Unicode, of the various sets of digits, of the diacritics?
Second "big" limitation: maintainability. They are complex and unreadable when they are written. But months later? They get worse! And if you have to train a new programmer, now he has to learn one more language: regex.
Third "big" limitation: they hide too much. You see \d\s\d
. What does it means? a digit a space and a digit? Surely. But both \d
and \s
in the .NET Regexes "hide" a microworld. \d
"matches" any non-european digit (and there are many many ones in Unicode). \s
"matches" so many esoteric spaces of which I don't even know the name... I don't even want to think about it. They are like icebergs. Only 1/8 is out of the water, while 7/8 is hidden. But it's that 7/8 that will probably kill you.
Limitations
Bottomline, it is a tool. Use it like any other tool. Don't over use it. Don't let it be the only tool in your toolkit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With