Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Ascii regex over non-ASCII characters with UTF-8

#include <boost/regex.hpp>

#include <string>
#include <vector>
#include <iostream>

int main(int argc, char* argv[]) {

    std::string text = argv[1];
    std::string patterns = argv[2];

    boost::regex regex = boost::regex(patterns);

    boost::smatch match;

    std::cout << boost::regex_search(text, match, regex) << std::endl;    
}

If I run the program over the input hello¿ ¿ (containing a non-ASCII character with UTF-8 encoding) it returns 0 i.e. not found, but if I run it over the input hel√ √ (again containing non-ascii) it returns 1, i.e. found.

My question: What is the expected behavior of boost::regex (i.e. the ascii version) when run over utf characters?


Edit: Thanks for all the comments, I am still interested as to why exactly 1 is output, since both the text and the regex contain non-ascii characters. My guess would be that the bytes are interpreted as ascii and thus they match.

like image 342
user695652 Avatar asked Jun 02 '16 18:06

user695652


People also ask

Does UTF-8 cover ASCII?

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

Is UTF-8 and ASCII same?

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Is it more efficient to use ASCII or UTF-8 as an encoding?

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.

Does regex use ASCII?

The regular expression represents all printable ASCII characters. ASCII code is the numerical representation of all the characters and the ASCII table extends from char NUL (Null) to DEL . The printable characters extend from CODE 32 (SPACE) to CODE 126 (TILDE[~]) .


1 Answers

  1. Using regular expressions on ASCII strings, is about using "bytes" to find a pattern in.
    Using regular expressions on UTF-8 strings, is about using regular expressions on "multi-byte" sequences, where a sequence represents a Unicode code point.

    Thus the regular expression gets applied to a Unicode string with an encoding with variable byte-count per character.

    UTF-8 strings contain multi-byte sequences with 1 to 4 bytes, which representing a Unicode "character". In UTF-8 only ASCII 7 bit characters are 1 byte "wide".

    So - using an ASCII regular expression engine on an UTF-8 encoded string, ignores the multi-byte sequences in the UTF-8 encoded string and causes a pattern matching byte by byte. The results of this ASCII regular expression engine usage on an UTF-8 encoded string is invalid.

    Please take a look at http://utfcpp.sourceforge.net.

    To get the regular expressions working on UTF-8 encoded strings, you have to …

    • have UTF-8 string iterators usable with the regular expressions, or
    • use std::codecvt_utf8 in combination of setting temporarily the global locale to get the regular expression working, or
    • have to convert the UTF-8 encoded string into a UTF-16 encoded string to be used with an Unicode regular expression engine - based on std::wstring.
  2. The regex_search function returns a boolean and true on a match.
    In your case the ASCII regular expression pattern matches a part of the UTF-8 encoded string, which is parsed invalidly as ASCII string - as you assumed!
    If you have English text in an UTF-8 encoded string, then an ASCII regular expression engine can be used safely. Leaving the ASCII 7 bit range, makes the result of the ASCII regular expression engine unreliable.

like image 135
Martin Lemburg Avatar answered Sep 19 '22 16:09

Martin Lemburg