Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character classification

The simple question again: having an std::string, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).

I managed to split the string into a set of characters using the boost locale boundary analysis tool:

std::string text = u8"生きるか死ぬか";

boost::locale::boundary::segment_index<std::string::const_iterator> characters(
    boost::locale::boundary::character,
    text.begin(), text.end(),
    boost::locale::generator()("ja_JP.UTF-8"));

for (const auto& ch : characters) {
    // each 'ch' is a single character in japanese language
}

However, I further do not see any way to determine if ch is a digit or a symbol or anything else. There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator is.

Nor I can apply std::isalpha(std::locale), because I'm unsure if it is possible to convert the boost segment into a char or wchar_t.

Is there any neat way to classify symbols?

like image 373
Ixanezis Avatar asked Jun 30 '14 07:06

Ixanezis


People also ask

What are the 4 types of characterization?

An acronym, PAIRS, can help you recall the five methods of characterization: physical description, action, inner thoughts, reactions, and speech.

What are the 2 classification of characters?

There are a few different ways in which you can classify main characters in a story: flat or round characters, protagonist or antagonist, and dynamic or static characters.


1 Answers

There are a number of functions and objects supporting this in <locale> but... The example text you give looks like UTF-8, which is a multibyte encoding, and the functions in <locale> don't work with multibyte encodings.

I'd suggest you get the ICU library, and use it. Amongst other things, it allows testing for all of the properties defined in the Unicode Character Database. It also has macros or functions for iterating over a string (or at least an array of char), extracting one UTF_32 codepoint at a time (which is what you'd want to test).

like image 81
James Kanze Avatar answered Oct 11 '22 02:10

James Kanze