Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When is it best to use Regular Expressions over basic string splitting / substring'ing?

Tags:

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.

The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.

So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.

If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.


EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.

I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.

like image 326
Steven Raybell Avatar asked Dec 10 '08 22:12

Steven Raybell


People also ask

When should you use regex?

Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.

Are string functions better than regular expressions?

Regex is instrinsically a process of pattern matching and should be used when the types of strings you want to match are variable or only conform to a particular pattern. For cases when a simple string search would suffice, I would always recommend using the in-built methods of the String class.

Should I use regex for parser?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Where can regex be used?

Regular expressions are used in search engines, in search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in lexical analysis.


1 Answers

My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.

One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.

For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:

1 + 2 5 * (10 - 6) ((1 + 1) / (2 + 2)) / 3 

A grammar is easy to write, and looks something like this:

DIGIT := ["0"-"9"] NUMBER := (DIGIT)+ OPERATOR := ("+" | "-" | "*" | "/" ) EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)? GROUP := "(" EXPRESSION ")" 

With that grammar, you can build a recursive descent parser in a jiffy.

An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.

Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.


Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.

I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"

Given that interpretation, my answer is: never.


Okay.... one more edit.

I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)

I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:

if (str.equals("DooWahDiddy")) // No problemo.  if (str.contains("destroy the earth")) // Okay.  if (str.indexOf(";") < str.length / 2) // Not bad. 

Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.

if (str.startsWith("I") && str.endsWith("Widget") &&     (!str.contains("Monkey") || !str.contains("Pox")))  // Madness. 

Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).

Here's a great reference:

http://www.regular-expressions.info/

PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.

like image 108
7 revs Avatar answered Oct 12 '22 16:10

7 revs