Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limitations of Regular Expressions? [closed]

Tags:

regex

I have been using Regular Expressions for a couple years now and feel comfortable with them, but I was wondering if there are any limitations when using them. I know about the limitations related to recursion (discussed here http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx ). Are there any limitations related to memory? I assume you can capture a string as large as you can fit in memory (or that the VM will allow you to).

Are there any other limitations with regex's that I should know about?

Thanks in advance,

Chris

like image 587
ChrisJF Avatar asked Oct 21 '11 19:10

ChrisJF


3 Answers

Regexs can only parse regular grammers anything context-free and higher you need a stack (i.e. a real parser).

That is their only real limitation, performance depends the particular implementation, but generally is slow even precompiled compared to a state machine.

like image 175
Garrett Hall Avatar answered Oct 08 '22 05:10

Garrett Hall


Ginormous regexes can be quite slow and memory hungry. I know, because I have created one. It can tokenize what shouldn't be tokenized by a regex. :-) if you want a link... Now... I haven't ever benchmarked "small" regexes so I don't know their speed. They surely are compact to write.

Ah I was forgetting, regexes are The Evil. Their main problem is that they are like an hammer and when you have them, you try to make all the problems be like a nail. So their main problem is in the user (the programmer).

First "big" limitation: Javascript implements only a subset of them, with no Unicode support. Normally the language you use server side has a more complete implementation, so you get limited by js. Even quite complete implementations like the .NET one have big limits: no support for surrogate pairs and no support for "composed" characters (characters that use combining mark). But, as always, the problem is in the programmer. How many programmers that know Unicode know the intricacies of Unicode, of the various sets of digits, of the diacritics?

Second "big" limitation: maintainability. They are complex and unreadable when they are written. But months later? They get worse! And if you have to train a new programmer, now he has to learn one more language: regex.

Third "big" limitation: they hide too much. You see \d\s\d. What does it means? a digit a space and a digit? Surely. But both \d and \s in the .NET Regexes "hide" a microworld. \d "matches" any non-european digit (and there are many many ones in Unicode). \s "matches" so many esoteric spaces of which I don't even know the name... I don't even want to think about it. They are like icebergs. Only 1/8 is out of the water, while 7/8 is hidden. But it's that 7/8 that will probably kill you.

like image 25
xanatos Avatar answered Oct 08 '22 04:10

xanatos


Limitations

  1. Cannot solve everything. ( anyone on SO would say what happens when you try to parse HTML with regex)
  2. Should not be used for everything - readability and performance issues. Use where appropriate. Not for simple task, like substrings of string, and also not for complex task.

Bottomline, it is a tool. Use it like any other tool. Don't over use it. Don't let it be the only tool in your toolkit.

like image 45
manojlds Avatar answered Oct 08 '22 06:10

manojlds