The simple question again: having an std::string
, determine which of its characters are digits, symbols, white spaces etc. with respect to the user's language and regional settings (locale).
I managed to split the string into a set of characters using the boost locale boundary analysis tool:
std::string text = u8"生きるか死ぬか";
boost::locale::boundary::segment_index<std::string::const_iterator> characters(
boost::locale::boundary::character,
text.begin(), text.end(),
boost::locale::generator()("ja_JP.UTF-8"));
for (const auto& ch : characters) {
// each 'ch' is a single character in japanese language
}
However, I further do not see any way to determine if ch
is a digit or a symbol or anything else.
There are boost string classification algorithms, but these don't seem to be working with.. whatever *segment_index::iterator
is.
Nor I can apply std::isalpha(std::locale)
, because I'm unsure if it is possible to convert the boost segment into a char
or wchar_t
.
Is there any neat way to classify symbols?
An acronym, PAIRS, can help you recall the five methods of characterization: physical description, action, inner thoughts, reactions, and speech.
There are a few different ways in which you can classify main characters in a story: flat or round characters, protagonist or antagonist, and dynamic or static characters.
There are a number of functions and objects supporting this in
<locale>
but... The example text you give looks like UTF-8,
which is a multibyte encoding, and the functions in <locale>
don't work with multibyte encodings.
I'd suggest you get the ICU library, and use it. Amongst other
things, it allows testing for all of the properties defined in
the Unicode Character Database. It also has macros or functions
for iterating over a string (or at least an array of char
),
extracting one UTF_32 codepoint at a time (which is what you'd
want to test).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With