Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is that a feature or a bug of clang c++11 std::regex_match?

I noted that a regular expression, containing two patterns with OR condition, do not match a sample string if the first pattern is a beginning part of the second pattern (tested on clang 3.5 and clang 3.8):

std::regex_match("ab", std::regex("(ab|a)")) == true

but

std::regex_match("ab", std::regex("(a|ab)")) == false

I think true is logically correct in both cases.

Clang & OSX:

$ cat > test.cpp
#include <string>
#include <regex>
#include <iostream>

int main() {
    std::cout << std::regex_match("ab", std::regex("(a|ab)")) << std::endl;
    std::cout << std::regex_match("ab", std::regex("(ab|a)")) << std::endl;

    return 0;
}
^C

$ clang++ -v
Apple LLVM version 8.1.0 (clang-802.0.41)
Target: x86_64-apple-darwin16.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

$ clang++ ./test.cpp -o test

$ ./test 
0
1

Clang & FreeBSD:

$ cat > test.cpp
#include <string>
#include <regex>
#include <iostream>

int main() {
    std::cout << std::regex_match("ab", std::regex("(a|ab)")) << std::endl;
    std::cout << std::regex_match("ab", std::regex("(ab|a)")) << std::endl;

    return 0;
}
^C
$ clang++ -v
FreeBSD clang version 3.8.0 (tags/RELEASE_380/final 262564) (based on LLVM 3.8.0)
Target: x86_64-unknown-freebsd11.0
Thread model: posix
InstalledDir: /usr/bin
$ clang++ ./test.cpp -o test
$ ./test
0
1

Linux & GCC:

$ cat > test.cpp
#include <string>
#include <regex>
#include <iostream>

int main() {
    std::cout << std::regex_match("ab", std::regex("(a|ab)")) << std::endl;
    std::cout << std::regex_match("ab", std::regex("(ab|a)")) << std::endl;

    return 0;
}
^C

$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.1-2ubuntu1~16.04' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.1 20160904 (Ubuntu 5.4.1-2ubuntu1~16.04) 

$ g++ -std=gnu++11 ./test.cpp -o test

$ ./test 
1
1
like image 250
Max Fomichev Avatar asked Apr 18 '17 15:04

Max Fomichev


1 Answers

ECMAScript (the default regex syntax) attempts to match the alternatives in order, stopping at the first success, which means that in ordinary searching (a la regex_search) the regex a|ab never matches the whole ab; it always matches just the a part.

The standard was ambiguous about what regex_match was supposed to do in this case, leading to implementation divergence. See LWG issue 2273 for the competing interpretations. Eventually the standard was amended (see the resolution of that issue) to make clear that regex_match only considers potential matches that match the entire input sequence, as the example added to the standard makes clear:

std::regex re("Get|GetValue");
std::cmatch m;
regex_search("GetValue", m, re);    // returns true, and m[0] contains "Get"
regex_match ("GetValue", m, re);    // returns true, and m[0] contains "GetValue"

The original <regex> implementation in libc++ used the other interpretation, however, and it simply wasn't updated to match the resolution until recently. Clang 4.0 now prints 1 1.

like image 56
T.C. Avatar answered Oct 20 '22 10:10

T.C.