#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>
void parse_simple_string()
{
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
//namespace stw = boost::spirit::standard_wide;
typedef std::wstring::const_iterator iterator_type;
std::vector<std::wstring> result;
std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";
qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
qi::phrase_parse(input.begin(), input.end(),
key % qi::lit(L"\",\""),
encoding::space,
result);
//std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t> (std::wcout, L"\n"));
for(auto const &data : result) std::wcout<<data<<std::endl;
}
I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好"
the expected results should be
12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好
but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP
Failed to parse chinese words "你好"
OS is win7 64bits, my editor save the words as UTF-8
UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII. Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point: str.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn't use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer.
If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.
For instance, use boost::u8_to_u32_iterator:
A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.
live demo
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>
int main()
{
using namespace boost;
using namespace spirit::qi;
using namespace std;
auto &&utf8_text=u8"你好,世界!";
u8_to_u32_iterator<const char*>
tbegin(begin(utf8_text)), tend(end(utf8_text));
vector<uint32_t> result;
parse(tbegin, tend, *standard_wide::char_, result);
for(auto &&code_point : result)
cout << "&#" << code_point << ";";
cout << endl;
}
Output is:
你好,世界!�
Although the answer of Evgeny Panasyuk is correct, the use of u8_to_u32_iterator
may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following:
File foobar.cpp
#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>
int main() {
const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};
using utf8_iter = boost::u8_to_u32_iterator<const char *>;
auto iter = utf8_iter{contents};
auto end = utf8_iter{contents + sizeof(contents)};
for (; iter != end; ++iter)
std::cout << *iter << '\n';
}
When compiled with the commands clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp
then run, clang address sanitizer will display stack-buffer-overflow
error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow.
If the last byte is NUL const char contents[] = "Hello\xF1";
, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors.
In short, make sure the input is NUL terminated before using boost::u8_to_u32_iterator
or you may risk encountering UB.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With