Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using alternation or character class for single character matching?

Tags:

regex

perl

(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)

Given this regex: (.*_e\.txt), which matches some filenames, I need to add some other single character suffixes in addition to the e. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)

That is, which of the following two seems "better", and why:

a) (.*(e|f|x)\.txt), or

b) (.*[efx]\.txt)

like image 620
Martin Ba Avatar asked Jan 18 '11 13:01

Martin Ba


People also ask

What regular expression would you use to match a single character?

Use square brackets [] to match any characters in a set. Use \w to match any single alphanumeric character: 0-9 , a-z , A-Z , and _ (underscore). Use \d to match any single digit. Use \s to match any single whitespace character.

How do you express alternation or in a regular expression?

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping.

How many characters does a character set match?

Please note that although there are multiple characters in the set, they correspond to exactly one character in the match.

Which regex character matches one or more of the previous character?

The character + in a regular expression means "match the preceding character one or more times". For example A+ matches one or more of character A. The plus character, used in a regular expression, is called a Kleene plus .


1 Answers

Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.

I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.

My reasoning (without ever having written a regex engine, so this is pure conjecture):

The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"

(a|b|c) however tells the regex engine to

  • remember the current position in the string for backtracking, if necessary
  • check if it's possible to match a. If so, success. If not:
  • check if it's possible to match b. If so, success. If not:
  • check if it's possible to match c. If so, success. If not:
  • give up.
like image 154
Tim Pietzcker Avatar answered Sep 24 '22 17:09

Tim Pietzcker