I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so?
What features of modern regular expressions are not regular? Examples would be helpful.
Regular expressions are used to denote regular languages. They can represent regular languages and operations on them succinctly. The set of regular expressions over an alphabet is defined recursively as below. Any element of that set is a regular expression.
Despite being hard to read, hard to validate, hard to document and notoriously hard to master, regexes are still widely used today. Supported by all modern programming languages, text processing programs and advanced text editors, regexes are now used in more than a third of both Python and JavaScript projects.
Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.
The first thing that comes to mind is backreferences:
(\w*)\s\1
(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello
matches, hello world
doesn't.
This construct is not regular (ie: can't be generated by a regular grammar).
Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:
\((a*|(?R))*\)
This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With