#include <boost/regex.hpp>
#include <string>
#include <vector>
#include <iostream>
int main(int argc, char* argv[]) {
std::string text = argv[1];
std::string patterns = argv[2];
boost::regex regex = boost::regex(patterns);
boost::smatch match;
std::cout << boost::regex_search(text, match, regex) << std::endl;
}
If I run the program over the input hello¿ ¿
(containing a non-ASCII character with UTF-8 encoding) it returns 0
i.e. not found, but if I run it over the input hel√ √ (again containing non-ascii) it returns 1, i.e. found.
My question: What is the expected behavior of boost::regex
(i.e. the ascii version) when run over utf characters?
Edit: Thanks for all the comments, I am still interested as to why exactly 1 is output, since both the text and the regex contain non-ascii characters. My guess would be that the bytes are interpreted as ascii and thus they match.
UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.
There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.
The regular expression represents all printable ASCII characters. ASCII code is the numerical representation of all the characters and the ASCII table extends from char NUL (Null) to DEL . The printable characters extend from CODE 32 (SPACE) to CODE 126 (TILDE[~]) .
Using regular expressions on ASCII strings, is about using "bytes" to find a pattern in.
Using regular expressions on UTF-8 strings, is about using regular expressions on "multi-byte" sequences, where a sequence represents a Unicode code point.
Thus the regular expression gets applied to a Unicode string with an encoding with variable byte-count per character.
UTF-8 strings contain multi-byte sequences with 1 to 4 bytes, which representing a Unicode "character".
In UTF-8 only ASCII 7 bit characters are 1 byte "wide".
So - using an ASCII regular expression engine on an UTF-8 encoded string, ignores the multi-byte sequences in the UTF-8 encoded string and causes a pattern matching byte by byte. The results of this ASCII regular expression engine usage on an UTF-8 encoded string is invalid.
Please take a look at http://utfcpp.sourceforge.net.
To get the regular expressions working on UTF-8 encoded strings, you have to …
std::codecvt_utf8
in combination of setting temporarily the global locale to get the regular expression working, orThe regex_search function returns a boolean and true
on a match.
In your case the ASCII regular expression pattern matches a part of the UTF-8 encoded string, which is parsed invalidly as ASCII string - as you assumed!
If you have English text in an UTF-8 encoded string, then an ASCII regular expression engine can be used safely. Leaving the ASCII 7 bit range, makes the result of the ASCII regular expression engine unreliable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With