Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between [abc] and (a|b|c) [duplicate]

Tags:

regex

pcre

For PCRE Regular Expressions, what is the difference between [abc] and (a|b|c)?

like image 893
user1032531 Avatar asked Oct 26 '25 10:10

user1032531


1 Answers

The patterns in your question match the same text. In terms of implementation, they correspond to different automata and side effects (i.e., whether they capture substrings).

In a comment below, Garrett Albright points out a subtle distinction. Whereas (.|\n) matches any character, [.\n] matches either a literal dot or a newline. Although dot is no longer special inside a character class, other characters such as -, ^, and ] along with sequences such as [:lower:] take special meanings inside a character class. Care is necessary to preserve special semantics from one context to the other, but sometimes it isn’t possible such as in the case of \1 as an archaic way of writing $1 outside a character class. Inside a character class, \1 always matches the character SOH.

Character classes ([...]) are optimized for matching one out of some set of characters, and alternatives (x|y) allow for more general choices of varying lengths. You will tend to see better performance if you keep these design principles in mind. Regex implementations transform source code such as /[abc]/ into finite-state automata, usually NFAs. What we think of as regex engines are more-or-less bookkeepers that assist execution of those target state machines. The sufficiently smart regex compiler will generate the same machine code for equivalent regexes, but this is difficult and expensive in the general case because of the lurking exponential complexity.

For an accessible introduction to the theory behind regexes, read “How Regexes Work” by Mark Dominus. For deeper study, consider An Introduction to Formal Languages and Automata by Peter Linz.

like image 129
Greg Bacon Avatar answered Oct 29 '25 02:10

Greg Bacon