<pre class="prettyprint"><code>#include <algorithm> #include <iostream> #include <string> #include <vector> #define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/qi_parse.hpp> #include <boost/spirit/include/support_standard_wide.hpp> void parse_simple_string() { namespace qi = boost::spirit::qi; namespace encoding = boost::spirit::unicode; //namespace stw = boost::spirit::standard_wide; typedef std::wstring::const_iterator iterator_type; std::vector<std::wstring> result; std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)"; qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\"")); qi::phrase_parse(input.begin(), input.end(), key % qi::lit(L"\",\""), encoding::space, result); //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t> (std::wcout, L"\n")); for(auto const &data : result) std::wcout<<data<<std::endl; } </code></pre> I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好" the expected results should be 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好 but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP Failed to parse chinese words "你好" OS is win7 64bits, my editor save the words as UTF-8

If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex. For instance, use boost::u8_to_u32_iterator: <blockquote> A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters. </blockquote> live demo <pre class="prettyprint"><code>#include <boost/regex/pending/unicode_iterator.hpp> #include <boost/spirit/include/qi.hpp> #include <boost/range.hpp> #include <iterator> #include <iostream> #include <ostream> #include <cstdint> #include <vector> int main() { using namespace boost; using namespace spirit::qi; using namespace std; auto &&utf8_text=u8"你好，世界！"; u8_to_u32_iterator<const char*> tbegin(begin(utf8_text)), tend(end(utf8_text)); vector<uint32_t> result; parse(tbegin, tend, *standard_wide::char_, result); for(auto &&code_point : result) cout << "&#" << code_point << ";"; cout << endl; } </code></pre> Output is: <pre class="prettyprint"><code>&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0; </code></pre>

Although the answer of Evgeny Panasyuk is correct, the use of <code>u8_to_u32_iterator</code> may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following: File foobar.cpp <pre class="prettyprint"><code>#include "boost/regex/pending/unicode_iterator.hpp" #include <iostream> int main() { const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'}; using utf8_iter = boost::u8_to_u32_iterator<const char *>; auto iter = utf8_iter{contents}; auto end = utf8_iter{contents + sizeof(contents)}; for (; iter != end; ++iter) std::cout << *iter << '\n'; } </code></pre> When compiled with the commands <code> clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp</code> then run, clang address sanitizer will display <code>stack-buffer-overflow</code> error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow. If the last byte is NUL <code>const char contents[] = "Hello\xF1";</code>, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors. In short, make sure the input is NUL terminated before using <code>boost::u8_to_u32_iterator</code> or you may risk encountering UB.

How to use boost::spirit to parse UTF-8?

Tags:

c++

unicode

utf-8

boost

boost-spirit

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>

void parse_simple_string()
{
    namespace qi = boost::spirit::qi;    
    namespace encoding  = boost::spirit::unicode;
    //namespace stw = boost::spirit::standard_wide;

    typedef std::wstring::const_iterator iterator_type;

    std::vector<std::wstring> result;
    std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";

    qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
    qi::phrase_parse(input.begin(), input.end(),
                     key % qi::lit(L"\",\""),
                     encoding::space,
                     result);

    //std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t>  (std::wcout, L"\n"));
    for(auto const &data : result) std::wcout<<data<<std::endl;
}

I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好"

the expected results should be

12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好

but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP

Failed to parse chinese words "你好"

OS is win7 64bits, my editor save the words as UTF-8

497

asked Dec 03 '12 08:12

StereoMatching

2 Answers

If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.

For instance, use boost::u8_to_u32_iterator:

A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.

live demo

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
    using namespace boost;
    using namespace spirit::qi;
    using namespace std;

    auto &&utf8_text=u8"你好，世界！";
    u8_to_u32_iterator<const char*>
        tbegin(begin(utf8_text)), tend(end(utf8_text));

    vector<uint32_t> result;
    parse(tbegin, tend, *standard_wide::char_, result);
    for(auto &&code_point : result)
        cout << "&#" << code_point << ";";
    cout << endl;
}

Output is:

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;

168

answered Sep 27 '22 01:09

Evgeny Panasyuk

Although the answer of Evgeny Panasyuk is correct, the use of u8_to_u32_iterator may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following:

File foobar.cpp

#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>

int main() {
    const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};

    using utf8_iter = boost::u8_to_u32_iterator<const char *>;
    auto iter = utf8_iter{contents};
    auto end = utf8_iter{contents + sizeof(contents)};

    for (; iter != end; ++iter)
        std::cout << *iter << '\n';
}

When compiled with the commands clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp then run, clang address sanitizer will display stack-buffer-overflow error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow.

If the last byte is NUL const char contents[] = "Hello\xF1";, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors.

In short, make sure the input is NUL terminated before using boost::u8_to_u32_iterator or you may risk encountering UB.

answered Sep 24 '22 01:09

mibu

Related questions
                            
                                Why is iterating a large array on the heap faster than iterating same size array on the stack?
                            
                                Forming reference to void
                            
                                OpenCV 2.4.2 findContours(), how to get only the straight lines contours
                            
                                Consistency when removing items from boost multi-index using an iterator
                            
                                Extract a block from a sparse matrix as another sparse matric
                            
                                Undefined reference to 'function' -- Linker issue?
                            
                                What is a signing catalog file member tag?
                            
                                Cython and constructors of classes
                            
                                Is there any performance gain if I use [this] instead of [=] in lambda functions?
                            
                                MPI_ERR_TRUNCATE: On Broadcast
                            
                                OpenCV SVM throwing exception on train, "Bad argument (There is only a single class)"
                            
                                Define a matrix in R and pass it to C++
                            
                                Troubles with boost::spirit::lex & whitespace
                            
                                Linking to Boost from Xcode
                            
                                Why having const and non-const accessors?
                            
                                Read File line by line to variable and loop
                            
                                Ran into this at work "operator ClassName *". What does this mean?
                            
                                Why is the virtual keyword needed?
                            
                                c++ loading large amount of data at compile time
                            
                                Error handling in C++, constructors vs. regular methods

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With