Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use boost::spirit to parse UTF-8?

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>

void parse_simple_string()
{
    namespace qi = boost::spirit::qi;    
    namespace encoding  = boost::spirit::unicode;
    //namespace stw = boost::spirit::standard_wide;

    typedef std::wstring::const_iterator iterator_type;

    std::vector<std::wstring> result;
    std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";

    qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
    qi::phrase_parse(input.begin(), input.end(),
                     key % qi::lit(L"\",\""),
                     encoding::space,
                     result);

    //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t>  (std::wcout, L"\n"));
    for(auto const &data : result) std::wcout<<data<<std::endl;
}

I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好"

the expected results should be

12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好

but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP

Failed to parse chinese words "你好"

OS is win7 64bits, my editor save the words as UTF-8

like image 497
StereoMatching Avatar asked Dec 03 '12 08:12

StereoMatching


People also ask

Does STD string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII. Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point: str.

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

What is UTF-8 and what problem does it solve?

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn't use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer.


2 Answers

If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.

For instance, use boost::u8_to_u32_iterator:

A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.

live demo

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
    using namespace boost;
    using namespace spirit::qi;
    using namespace std;

    auto &&utf8_text=u8"你好,世界!";
    u8_to_u32_iterator<const char*>
        tbegin(begin(utf8_text)), tend(end(utf8_text));

    vector<uint32_t> result;
    parse(tbegin, tend, *standard_wide::char_, result);
    for(auto &&code_point : result)
        cout << "&#" << code_point << ";";
    cout << endl;
}

Output is:

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;
like image 168
Evgeny Panasyuk Avatar answered Sep 27 '22 01:09

Evgeny Panasyuk


Although the answer of Evgeny Panasyuk is correct, the use of u8_to_u32_iterator may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following:

File foobar.cpp

#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>

int main() {
    const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};

    using utf8_iter = boost::u8_to_u32_iterator<const char *>;
    auto iter = utf8_iter{contents};
    auto end = utf8_iter{contents + sizeof(contents)};

    for (; iter != end; ++iter)
        std::cout << *iter << '\n';
}

When compiled with the commands clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp then run, clang address sanitizer will display stack-buffer-overflow error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow.

If the last byte is NUL const char contents[] = "Hello\xF1";, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors.

In short, make sure the input is NUL terminated before using boost::u8_to_u32_iterator or you may risk encountering UB.

like image 35
mibu Avatar answered Sep 24 '22 01:09

mibu