How can I match utf8 unicode characters using <code>boost::spirit</code>? For example, I want to recognize all characters in this string: <pre class="prettyprint"><code>$ echo "На берегу пустынных волн" | ./a.out Н а б е р е гу п у с т ы н н ы х в о л н </code></pre> When I try this simple <code>boost::spirit</code> program it will not match the unicode characters correctly: <pre class="prettyprint"><code>#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> namespace qi = boost::spirit::qi; int main() { std::cin.unsetf(std::ios::skipws); boost::spirit::istream_iterator begin(std::cin); boost::spirit::istream_iterator end; std::vector<char> letters; bool result = qi::phrase_parse( begin, end, // input +qi::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(char letter, letters) { std::cout << letter << " "; } std::cout << std::endl; } </code></pre> It behaves like this: <pre class="prettyprint"><code>$ echo "На берегу пустынных волн" | ./a.out | less <D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> <B2> <D0> <BE> <D0> <BB> <D0> <BD> </code></pre> UPDATE: Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here): <pre class="prettyprint"><code>#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> #include <boost/regex/pending/unicode_iterator.hpp> namespace qi = boost::spirit::qi; int main() { std::string str = "На берегу пустынных волн"; boost::u8_to_u32_iterator<std::string::const_iterator> begin(str.begin()), end(str.end()); typedef boost::uint32_t uchar; // a unicode code point std::vector<uchar> letters; bool result = qi::phrase_parse( begin, end, // input +qi::standard_wide::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(uchar letter, letters) { std::cout << letter << " "; } std::cout << std::endl; } </code></pre> The code prints the Unicode code points: <pre class="prettyprint"><code>$ ./a.out 1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085 </code></pre> which seems to be correct, according to the official Unicode table. Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode. <pre class="prettyprint"><code>#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout </code></pre> See, e.g. the sexpr parser sample which is in the scheme demo. <pre class="prettyprint"><code>BOOST_ROOT/libs/spirit/example/scheme </code></pre> I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases: <ul> <li>wchar support</li> <li>utree attributes (still experimental)</li> <li>s-expressions</li> </ul> There is an online article about S-expressions and variant. <hr> 1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

How to match unicode characters with boost::spirit?

Tags:

c++

parsing

boost

boost-spirit

How can I match utf8 unicode characters using boost::spirit?

For example, I want to recognize all characters in this string:

$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н

When I try this simple boost::spirit program it will not match the unicode characters correctly:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::cin.unsetf(std::ios::skipws);
  boost::spirit::istream_iterator begin(std::cin);
  boost::spirit::istream_iterator end;

  std::vector<char> letters;
  bool result = qi::phrase_parse(
      begin, end,  // input     
      +qi::char_,  // match every character
      qi::space,   // skip whitespace 
      letters);    // result    

  BOOST_FOREACH(char letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

It behaves like this:

$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> 
<B2> <D0> <BE> <D0> <BB> <D0> <BD>

UPDATE:

Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::string str = "На берегу пустынных волн";
  boost::u8_to_u32_iterator<std::string::const_iterator>
      begin(str.begin()), end(str.end());
  typedef boost::uint32_t uchar; // a unicode code point
  std::vector<uchar> letters;
  bool result = qi::phrase_parse(
      begin, end,             // input
      +qi::standard_wide::char_,  // match every character
      qi::space,              // skip whitespace
      letters);               // result
  BOOST_FOREACH(uchar letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

The code prints the Unicode code points:

$ ./a.out 
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085

which seems to be correct, according to the official Unicode table.

Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

319

asked May 06 '12 21:05

Frank

1 Answers

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach¹, which specifically showcases:

wchar support
utree attributes (still experimental)
s-expressions

There is an online article about S-expressions and variant.

¹ In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

answered Sep 22 '22 06:09

sehe

Related questions
                            
                                Difference in MultiThread aspect between Java and C/C++
                            
                                DLL without exported functions?
                            
                                How can I pull variadic template arguments off from the tail instead of the head?
                            
                                How to call generic template function in a specialization version
                            
                                Why valarray does not have a swap function like vector in C++03? [closed]
                            
                                const array vs static const array in a function
                            
                                Why does std::sub_match<T> publicly inherit from std::pair<T, T>?
                            
                                Is it possible to stringify enum values in C++11 without macros?
                            
                                OpenCV cv::findHomography runtime error
                            
                                Declaring C++ static member functions as friends of the class in which it resides (syntax)
                            
                                Does the C++ standard allow using a typedef to rename a constructor?
                            
                                If a thrown exception is always a copy of the exception object, why isn't this copy constructor being invoked?
                            
                                C++, how to correctly copy std::vector<Class *> in copy constructor?
                            
                                Template method of template class specialization
                            
                                C++: Can I cast a vector <derived_class> to a vector <base_class> during a function call?
                            
                                GCC Not linking correct libraries
                            
                                Performance of breaking apart one loop into two loops
                            
                                gdb not catching std::out_of_range thrown by vector
                            
                                Is RVO allowed when a copy constructor is private and not implemented?
                            
                                Is substitution performed on a variadic parameter pack type if the pack is empty?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With