Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match unicode characters with boost::spirit?

How can I match utf8 unicode characters using boost::spirit?

For example, I want to recognize all characters in this string:

$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н

When I try this simple boost::spirit program it will not match the unicode characters correctly:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::cin.unsetf(std::ios::skipws);
  boost::spirit::istream_iterator begin(std::cin);
  boost::spirit::istream_iterator end;

  std::vector<char> letters;
  bool result = qi::phrase_parse(
      begin, end,  // input     
      +qi::char_,  // match every character
      qi::space,   // skip whitespace 
      letters);    // result    

  BOOST_FOREACH(char letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

It behaves like this:

$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> 
<B2> <D0> <BE> <D0> <BB> <D0> <BD> 

UPDATE:

Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;

int main() {
  std::string str = "На берегу пустынных волн";
  boost::u8_to_u32_iterator<std::string::const_iterator>
      begin(str.begin()), end(str.end());
  typedef boost::uint32_t uchar; // a unicode code point
  std::vector<uchar> letters;
  bool result = qi::phrase_parse(
      begin, end,             // input
      +qi::standard_wide::char_,  // match every character
      qi::space,              // skip whitespace
      letters);               // result
  BOOST_FOREACH(uchar letter, letters) {
    std::cout << letter << " ";
  }
  std::cout << std::endl;
}

The code prints the Unicode code points:

$ ./a.out 
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085 

which seems to be correct, according to the official Unicode table.

Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?

like image 319
Frank Avatar asked May 06 '12 21:05

Frank


People also ask

What is a Unicode code character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

What is the highest Unicode character?

The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points. Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14.

How many characters are in a Unicode string?

As of Unicode version 15.0, there are 149,186 characters with code points, covering 161 modern and historical scripts, as well as multiple symbol sets.


1 Answers

I haven't got much experience with it, but apparently Spirit (SVN trunk version) supports Unicode.

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See, e.g. the sexpr parser sample which is in the scheme demo.

BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on the demo from a presentation by Bryce Lelbach1, which specifically showcases:

  • wchar support
  • utree attributes (still experimental)
  • s-expressions

There is an online article about S-expressions and variant.


1 In case it is indeed, here is the video from that presentation and the slides (pdf) as found here (odp)

like image 69
sehe Avatar answered Sep 22 '22 06:09

sehe