Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are regexes really maintainable?

Any code I've seen that uses Regexes tends to use them as a black box:

  1. Put in string
  2. Magic Regex
  3. Get out string

This doesn't seem a particularly good idea to use in production code, as even a small change can often result in a completely different regex.

Apart from cases where the standard is permanent and unchanging, are regexes the way to do things, or is it better to try different methods?

like image 229
Rich Bradshaw Avatar asked Sep 29 '08 21:09

Rich Bradshaw


People also ask

Are regexes efficient?

Regular Expressions are efficient in that one line of code can save you writing hundreds of lines. But they're normally slower (even pre-compiled) than thoughtful hand written code simply due to the overhead. Generally the simpler the objective the worse Regular Expressions are. They're better for complex operations.

Are regular expressions worth it?

Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.

Is regex still used?

Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects.

Why is regex so unreadable?

The unwarranted use of regular expressions can lead to unreadable expressions. Regular expressions need not be encoded as strings.


5 Answers

If regexes are long and impenetrable, making them hard to maintain then they should be commented.

A lot of regex implementations allow you to pad regexes with whitespace and comments.
See https://www.regular-expressions.info/freespacing.html#parenscomment
and Coding Horror: Regular Expressions: Now You Have Two Problems

Any code I've seen that uses Regexes tends to use them as a black box:

If by black box you mean abstraction, that's what all programming is, trying to abstract away the difficult part (parsing strings) so that you can concentrate on the problem domain (what kind of strings do I want to match).

even a small change can often result in a completely different regex.

That's true of any code. As long as you are testing your regex to make sure it matches the strings you expect, ideally with unit tests, then you should be confident at changing them.

Edit: please also read Jeff's comment to this answer about production code.

like image 176
Sam Hasler Avatar answered Oct 05 '22 19:10

Sam Hasler


Obligatory.

It really comes down to the regex. If it's this huge monolithic expression, then yes, it's a maintainability problem. If you can express them succinctly (perhaps by breaking them up), or if you have good comments and tools to help you understand them, then they can be a powerful tool.

like image 37
Joel Coehoorn Avatar answered Oct 05 '22 19:10

Joel Coehoorn


I don't know which language you're using, but Perl - for example - supports the x flag, so spaces are ignored in regexes unless escaped, so you can break it into several lines and comment everything inline:

$foo =~ m{
    (some-thing)          # matches something
    \s*                   # matches any amount of spaces
    (match another thing) # matches something else
}x;

This helps making long regexes more readable.

like image 42
jkramer Avatar answered Oct 05 '22 21:10

jkramer


It only seems like magic if you don't understand the regex. Any number of small changes in production code can cause major problems so that is not a good reason, in my opinion, to not use regex's. Thorough testing should point out any problems.

like image 7
DMKing Avatar answered Oct 05 '22 19:10

DMKing


Small changes to any code in any language can result in completely different results. Some of them even prevent compilation.

Substitute regex with "C" or "C#" or "Java" or "Python" or "Perl" or "SQL" or "Ruby" or "awk" or ... anything, really, and you get the same question.

Regex is just another language, Huffman coded to be efficient at string matching. Just like Java, Perl, PHP, or especially SQL, each language has strengths and weaknesses, and you need to know the language you're writing in when you're writing it (or maintaining it) to have any hope of being productive.

Edit: Mike, regex's are Huffman coded in that common things to do are shorter than than rarer things. Literal matches of text is generally a single character (the one you want to match). Special characters exist - the common ones are short. Special constructs, such as (?:) are longer. These are not the same things that would be common in general-purpose languages like Perl, C++, etc., so the Huffman coding was targetted at this specialisation.

like image 7
Tanktalus Avatar answered Oct 05 '22 19:10

Tanktalus