Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boost.Locale and isprint

I'm looking for a means to display an UTF-8 string with its nonprintable/invalid characters escaped. In the days of ASCII, I was used to use isprint to decide whether a character should be printed as is, or escaped. With UTF-8, iterating is more difficult, but Boost.Locale does this well. However I didn't find anything in it to decide whether some character is printable, or even actually valid.

In the following source, the string "Hello あにま ➦ 👙 𝕫⊆𝕢 \x02\x01\b \xff\xff\xff " contains a few bad guys that are not printable (\b for instance) and others are plain invalid sequences (\xff\xff\xff). What test should I perform to decide whether a character is printable or not?

// Based on an example of Boost.Locale.
#include <boost/locale.hpp>
#include <iostream>
#include <iomanip>

int main()
{
  using namespace boost::locale;
  using namespace std;

  generator gen;
  std::locale loc = gen("");
  locale::global(loc); 
  cout.imbue(loc);

  string text = "Hello あにま ➦ 👙 𝕫⊆𝕢 \x02\x01\b \xff\xff\xff ";

  cout << text << endl;

  boundary::ssegment_index index(boundary::character, text.begin(), text.end());

  for (auto p: index)
    {
      cout << '['  << p << '|';
      for (uint8_t c: p)
        cout << std::hex << std::setw(2) << std::setfill('0') << int(c);
      cout << "] ";
    }
  cout << '\n';
}

When run, it gives

[H|48] [e|65] [l|6c] [l|6c] [o|6f] [ |20] [あ|e38182] [に|e381ab] [ま|e381be]
[ |20] [➦|e29ea6] [ |20] [👙|f09f9199] [ |20] [𝕫|f09d95ab]
[⊆|e28a86] [𝕢|f09d95a2] [ |20] [|02] [|01] |08] [ |20] [??? |ffffff20]

How should I decide that [|01] is not printable, and neither is [??? |ffffff20], but [o|6f] is, and so is [👙|f09f9199]? Roughly, the test should allow me to decide whether to print the left member of the [|]-pair, or the right one when not isprint.

Thanks

like image 622
akim Avatar asked Oct 31 '14 14:10

akim


1 Answers

Unicode has properties for each code point, which include a general category, and a technical report lists regex classifications (alpha, graph, etc). The unicode print classification includes tabs, where std::isprint (using the C locale) does not. print does include letters, marks, numbers, punctuations, symbols, space, and formatting code points. The formatting code points do not include CR or LF, but do include code points that affect the appearance of neighboring characters. I believe this to be exactly what you wanted (with the exception of the tab); the specification was designed carefully to support these character properties.

Most classification functions, like std::isprint, can only be given a single scalar value at a time, so UTF32 is the obvious encoding choice. Unfortunately, there is no guarantee that your system supports a UTF32 locale, nor is guaranteed that wchar_t is the necessary 20-bits needed to hold all unicode code points. Therefore, I would consider using boost::spirit::char_encoding::unicode for classification if you can. It has an internal table of all of the unicode categories, and implements the classifications listed in the regex technical report. It looks like it uses an older Unicode 5.2 database, but the C++ used to generate the tables are provided, and can be applied to the newer files.

The multi-byte UTF8 sequence will still need to be converted to individual codepoints (UTF32), and you specifically mentioned the ability to skip past invalid UTF8 sequences. Since I am a C++ programmer, I decided to unnecessarily spam your screen, and implement a constexpr UTF8->UTF32 function:

#include <cstdint>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <boost/range/iterator_range.hpp>
#include <boost/spirit/home/support/char_encoding/unicode.hpp>

namespace {
struct multi_byte_info {
  std::uint8_t id_mask;
  std::uint8_t id_matcher;
  std::uint8_t data_mask;
};

constexpr const std::uint8_t multi_byte_id_mask = 0xC0;
constexpr const std::uint8_t multi_byte_id_matcher = 0x80;
constexpr const std::uint8_t multi_byte_data_mask = 0x3F;
constexpr const std::uint8_t multi_byte_bits = 6;
constexpr const multi_byte_info multi_byte_infos[] = {
    // skip 1 byte info
    {0xE0, 0xC0, 0x1F},
    {0xF0, 0xE0, 0x0F},
    {0xF8, 0xF0, 0x07}};
constexpr const unsigned max_length =
    (sizeof(multi_byte_infos) / sizeof(multi_byte_info));

constexpr const std::uint32_t overlong[] = {0x80, 0x800, 0x10000};
constexpr const std::uint32_t max_code_point = 0x10FFFF;
}

enum class extraction : std::uint8_t { success, failure };

struct extraction_attempt {
  std::uint32_t code_point;
  std::uint8_t bytes_processed;
  extraction status;
};

template <typename Iterator>
constexpr extraction_attempt next_code_point(Iterator position,
                                             const Iterator &end) {
  static_assert(
      std::is_same<typename std::iterator_traits<Iterator>::iterator_category,
                   std::random_access_iterator_tag>{},
      "bad iterator type");

  extraction_attempt result{0, 0, extraction::failure};

  if (end - position) {
    result.code_point = std::uint8_t(*position);
    ++position;
    ++result.bytes_processed;

    if (0x7F < result.code_point) {
      unsigned expected_length = 1;

      for (const auto info : multi_byte_infos) {
        if ((result.code_point & info.id_mask) == info.id_matcher) {
          result.code_point &= info.data_mask;
          break;
        }
        ++expected_length;
      }

      if (max_length < expected_length || (end - position) < expected_length) {
        return result;
      }

      for (unsigned byte = 0; byte < expected_length; ++byte) {
        const std::uint8_t next_byte = *(position + byte);
        if ((next_byte & multi_byte_id_mask) != multi_byte_id_matcher) {
          return result;
        }

        result.code_point <<= multi_byte_bits;
        result.code_point |= (next_byte & multi_byte_data_mask);
        ++result.bytes_processed;
      }

      if (max_code_point < result.code_point) {
        return result;
      }

      if (overlong[expected_length - 1] > result.code_point) {
        return result;
      }
    }

    result.status = extraction::success;
  } // end multi-byte processing

  return result;
}

template <typename Range>
constexpr extraction_attempt next_code_point(const Range &range) {
  return next_code_point(std::begin(range), std::end(range));
}

template <typename T>
boost::iterator_range<T>
next_character_bytes(const boost::iterator_range<T> &range,
                     const extraction_attempt result) {
  return boost::make_iterator_range(range.begin(),
                                    range.begin() + result.bytes_processed);
}

template <std::size_t Length>
constexpr bool test(const char (&range)[Length],
                    const extraction expected_status,
                    const std::uint32_t expected_code_point,
                    const std::uint8_t expected_bytes_processed) {
  const extraction_attempt result =
      next_code_point(std::begin(range), std::end(range) - 1);
  switch (expected_status) {
  case extraction::success:
    return result.status == extraction::success &&
           result.bytes_processed == expected_bytes_processed &&
           result.code_point == expected_code_point;
  case extraction::failure:
    return result.status == extraction::failure &&
           result.bytes_processed == expected_bytes_processed;
  default:
    return false;
  }
}

int main() {
  static_assert(test("F", extraction::success, 'F', 1), "");
  static_assert(test("\0", extraction::success, 0, 1), "");
  static_assert(test("\x7F", extraction::success, 0x7F, 1), "");
  static_assert(test("\xFF\xFF", extraction::failure, 0, 1), "");

  static_assert(test("\xDF", extraction::failure, 0, 1), "");
  static_assert(test("\xDF\xFF", extraction::failure, 0, 1), "");
  static_assert(test("\xC1\xBF", extraction::failure, 0, 2), "");
  static_assert(test("\xC2\x80", extraction::success, 0x80, 2), "");
  static_assert(test("\xDF\xBF", extraction::success, 0x07FF, 2), "");

  static_assert(test("\xEF\xBF", extraction::failure, 0, 1), "");
  static_assert(test("\xEF\xBF\xFF", extraction::failure, 0, 2), "");
  static_assert(test("\xE0\x9F\xBF", extraction::failure, 0, 3), "");
  static_assert(test("\xE0\xA0\x80", extraction::success, 0x800, 3), "");
  static_assert(test("\xEF\xBF\xBF", extraction::success, 0xFFFF, 3), "");

  static_assert(test("\xF7\xBF\xBF", extraction::failure, 0, 1), "");
  static_assert(test("\xF7\xBF\xBF\xFF", extraction::failure, 0, 3), "");
  static_assert(test("\xF0\x8F\xBF\xBF", extraction::failure, 0, 4), "");
  static_assert(test("\xF0\x90\x80\x80", extraction::success, 0x10000, 4), "");
  static_assert(test("\xF4\x8F\xBF\xBF", extraction::success, 0x10FFFF, 4), "");
  static_assert(test("\xF7\xBF\xBF\xBF", extraction::failure, 0, 4), "");

  static_assert(test("𝕫", extraction::success, 0x1D56B, 4), "");

  constexpr const static char text[] =
      "Hello あにま ➦ 👙 𝕫⊆𝕢 \x02\x01\b \xff\xff\xff ";

  std::cout << text << std::endl;

  auto data = boost::make_iterator_range(text);
  while (!data.empty()) {
    const extraction_attempt result = next_code_point(data);
    switch (result.status) {
    case extraction::success:
      if (boost::spirit::char_encoding::unicode::isprint(result.code_point)) {
        std::cout << next_character_bytes(data, result);
        break;
      }

    default:
    case extraction::failure:
      std::cout << "[";
      std::cout << std::hex << std::setw(2) << std::setfill('0');
      for (const auto byte : next_character_bytes(data, result)) {
        std::cout << int(std::uint8_t(byte));
      }
      std::cout << "]";
      break;
    }

    data.advance_begin(result.bytes_processed);
  }

  return 0;
}

Output:

Hello あにま ➦ 👙 𝕫⊆𝕢  ��� 
Hello あにま ➦ 👙 𝕫⊆𝕢 [02][01][08] [ff][ff][ff] [00]

If my UTF8->UTF32 implementation scares you, or if you need support for the users locale:

  • std::mbtoc32
    • Impressive because it is the most obvious choice, and yet is not implemented in libstdc++ or libc++ (maybe trunk builds?)
    • Is not reetrant (current locale and be changed elsewhere suddenly)
  • iterators provided by boost.
    • Throws on invalid sequences making it unusable (can't progress past bad sequences).
  • boost::locale::conv and C++11 std::codecvt
    • Designed to convert ranges of encodings.
    • Need to either output UTF32 to the console (change locale), or convert a character at-a-time to match the source byte(s) with the UTF32 value.
  • UTF8-CPP utf::next (and non-throwing utf8::internal::validate_next).
    • IMO both inconsistently update the iterator position. If the function fails some sanity checks, the iterator position is at last byte of a valid utf8 sequence representing a bad code point. The documentation says:

it: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.

which doesn't indicate the side effects on exceptions (there definitely are some).

like image 194
Lee Clagett Avatar answered Nov 18 '22 08:11

Lee Clagett