I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).
The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947
, of which, U+0938
and U+0947
are combining marks.
static void Main(string[] args)
{
const string s = "नमस्ते";
Console.WriteLine(s.Length); // Ouptuts "6"
var l = 0;
var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while(e.MoveNext()) l++;
Console.WriteLine(l); // Outputs "4"
}
So there we have it in .NET. We also have Win32's CharNextW()
#include <Windows.h>
#include <iostream>
#include <string>
int main()
{
const wchar_t * s = L"नमस्ते";
std::cout << std::wstring(s).length() << std::endl; // Gives "6"
int l = 0;
while(CharNextW(s) != s)
{
s = CharNextW(s);
++l;
}
std::cout << l << std::endl; // Gives "4"
return 0;
}
Both ways I know of are specific to Microsoft. Are there portable ways to do it?
UnicodeString(s).length()
still gives 6). Would be an acceptable answer to point to the related function/module in ICU.@McDowell gave the hint to use BreakIterator
from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):
#include <unicode/schriter.h>
#include <unicode/brkiter.h>
#include <iostream>
#include <cassert>
#include <memory>
int main()
{
const UnicodeString str(L"नमस्ते");
{
// StringCharacterIterator doesn't seem to recognize graphemes
StringCharacterIterator iter(str);
int count = 0;
while(iter.hasNext())
{
++count;
iter.next();
}
std::cout << count << std::endl; // Gives "6"
}
{
// BreakIterator works!!
UErrorCode err = U_ZERO_ERROR;
std::unique_ptr<BreakIterator> iter(
BreakIterator::createCharacterInstance(Locale::getDefault(), err));
assert(U_SUCCESS(err));
iter->setText(str);
int count = 0;
while(iter->next() != BreakIterator::DONE) ++count;
std::cout << count << std::endl; // Gives "4"
}
return 0;
}
You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).
Glib's ustring class gives you utf-8 strings, if using utf-8 is ok for you. It is designed to be similar to std::string
. Since utf-8 is native for Linux, your task is quite easy:
int main()
{
Glib::ustring s = L"नमस्ते";
cout << s.size();
}
you can also iterate on string's characters as usual with Glib::ustring::iterator
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With