Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does C++11 support 6 different regular expression grammars?

Tags:

c++

regex

std

c++11

It appears that C++11 supports a whopping six different regular expression grammars:

  • ECMA-262 (ECMAScript) regular expressions (slightly modified?)
  • Basic POSIX regular expressions
  • Extended POSIX regular expressions
  • awk regular expressions
  • grep regular expressions
  • egrep regular expressions

Why was it decided to include so many options instead of settling on a single grammar? Why these particular 6?

like image 445
rkjnsn Avatar asked Mar 17 '12 01:03

rkjnsn


People also ask

Does C support regular expressions?

In C programming language there is a library known as POSIX that is used for a regular expression, whereas there are different regular expression libraries in different programming languages. C does not include regular expression but you can use them by using the library.

What kind of regex does C++ use?

C++11 uses ECMAScript grammar as the default grammar for regex. ECMAScript is simple, yet it provides powerful regex capabilities.

What is regex library in C++?

(C++11) [edit] The regular expressions library provides a class that represents regular expressions, which are a kind of mini-language used to perform pattern matching within strings. Almost all operations with regexes can be characterized by operating on several of the following objects: Target sequence.

Is C++ regex slow?

The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don't need a runtime RE parser engine as the pattern is known during compilation in many common use cases. I think this breaks C++'s promise of ​“don't pay for what you don't use.”


2 Answers

The standardization process is all about pragmatism. There are benefits to including a RE grammar in the standard, as long as it's correctly specified, but no benefit to dropping one.

Exclusion would make it easier for a library implementer to apply a "100% C++11 compliant" badge, but who really cares? Nobody should be making that claim anyway, and only ignorant PHBs would be looking for it. Libraries always have bugs which prevent reaching 100%, and a good library has an excess of features.

Note that all the included grammars are specified by already existing international standards. So little effort is needed on the part of the C++ committee. Just §28.13, which is a couple pages long.

If they leave out a standardized grammar, then different Standard Library implementers will add it under different names, resulting in incompatibility. This is unlikely to happen for a grammar which is merely defined by a popular library, where the library implementer will be responsible for the C++ interface, not Standard Library vendors.

like image 151
Potatoswatter Avatar answered Sep 26 '22 22:09

Potatoswatter


This is covered by the TR1 proposal. I will attempt to summarize.

It seemed prudent to build on an existing standard rather than to strike out on their own.

Two existing standards that they could build upon were identified: POSIX REs and ECMAScript REs. Perl REs were left out because they aren’t standardized. (Which reasonable people could disagree with.) Also, ECMAScript REs were seen as an simpler subset of Perl REs which covers the most useful (or perhaps most used) features.

Of the two, POSIX REs’ “leftmost longest” implementation did not play well with important features, like non-greedy repeats, and was at odds with how most RE engines work these days.

On the other hand, ECMAScript REs lacked the localization support of POSIX REs. So, they extended ECMAScript REs to include POSIX-RE—style localization support.

POSIX RE support was included as optional since it’s behavior is different enough from ECMAScript REs to justify it being an standard option. The POSIX standard comes with two grammars: Basic and extended. The awk, grep, and egrep REs are all just trivial variations to the basic or extended POSIX grammars rather than truly separate grammars.

So: Two standards, three grammars, six variations.

like image 44
Robert Fisher Avatar answered Sep 24 '22 22:09

Robert Fisher