Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do C++11 regular expressions work with UTF-8 strings?

If I want to use C++11's regular expressions with unicode strings, will they work with char* as UTF-8 or do I have to convert them to a wchar_t* string?

like image 852
Mark Avatar asked Jun 28 '12 23:06

Mark


People also ask

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

Does std::string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Does regex work only with strings?

So, yes, regular expressions really only apply to strings. If you want a more complicated FSM, then it's possible to write one, but not using your local regex engine.


1 Answers

You would need to test your compiler and the system you are using, but in theory, it will be supported if your system has a UTF-8 locale. The following test returned true for me on Clang/OS X.

bool test_unicode() {     std::locale old;     std::locale::global(std::locale("en_US.UTF-8"));      std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);     bool result = std::regex_match(std::string("abcdéfg"), pattern);      std::locale::global(old);      return result; } 

NOTE: This was compiled in a file what was UTF-8 encoded.


Just to be safe I also used a string with the explicit hex versions. It worked also.

bool test_unicode2() {     std::locale old;     std::locale::global(std::locale("en_US.UTF-8"));      std::regex pattern("[[:alpha:]]+", std::regex_constants::extended);     bool result = std::regex_match(std::string("abcd\xC3\xA9""fg"), pattern);      std::locale::global(old);      return result; } 

Update test_unicode() still works for me

$ file regex-test.cpp  regex-test.cpp: UTF-8 Unicode c program text  $ g++ --version Configured with: --prefix=/Applications/Xcode-8.2.1.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1 Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin15.6.0 Thread model: posix InstalledDir: /Applications/Xcode-8.2.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin 
like image 126
Jeffery Thomas Avatar answered Sep 21 '22 06:09

Jeffery Thomas