Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode string normalization in C/C++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize .

I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer lightweight solutions.

Is there any "lightweight" solution for this?

like image 620
Ghassen Hamrouni Avatar asked Feb 03 '11 10:02

Ghassen Hamrouni


People also ask

Why do we normalize Unicode?

Unicode normalization converts the different representations to the same form so they can be compared. All conforming processors must support the NFC format. They are also free to support any or all of the other formats defined by Unicode, and they can support their own formats if they want.

What is NFKD normalization?

Normalization Form Canonical Composition. Characters are decomposed and then recomposed by canonical equivalence. NFKD. Normalization Form Compatibility Decomposition. Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.

How do you normalize a string?

The string. normalize() is an inbuilt method in javascript which is used to return a Unicode normalisation form of a given input string. If the given input is not a string, then at first it will be converted into a string then this method will work.

What on earth is Unicode normalization?

Unicode normalization is our solution to both canonical and compatibility equivalence issues. In normalization, there are two directions and two types of conversions we can make. The two types we have already covered, canonical and compatibility.


2 Answers

As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization.

like image 59
Avi Avatar answered Oct 04 '22 08:10

Avi


For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN):

http://msdn.microsoft.com/en-us/library/windows/desktop/dd319093%28v=vs.85%29.aspx

It's the simplest way to go that I have found so far. I guess it's quite lightweight too.

int NormalizeString(
    _In_      NORM_FORM NormForm,
    _In_      LPCWSTR   lpSrcString,
    _In_      int       cwSrcLength,
    _Out_opt_ LPWSTR    lpDstString,
    _In_      int       cwDstLength
);
like image 25
NoOne Avatar answered Oct 04 '22 09:10

NoOne