Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove invalid code points from a string?

Tags:

c#

unicode

I have a routine that needs to be supplied with normalized strings. However, the data that's coming in isn't necessarily clean, and String.Normalize() raises ArgumentException if the string contains invalid code points.

What I'd like to do is just replace those code points with a throwaway character such as '?'. But to do that I need an efficient way to search through the string to find them in the first place. What is a good way to do that?

The following code works, but it's basically using try/catch as a crude if-statement so performance is terrible. I'm just sharing it to illustrate the behavior I'm looking for:

private static string ReplaceInvalidCodePoints(string aString, string replacement)
{
    var builder = new StringBuilder(aString.Length);
    var enumerator = StringInfo.GetTextElementEnumerator(aString);

    while (enumerator.MoveNext())
    {
        string nextElement;
        try { nextElement = enumerator.GetTextElement().Normalize(); }
        catch (ArgumentException) { nextElement = replacement; }
        builder.Append(nextElement);
    }

    return builder.ToString();
}

(edit:) I'm thinking converting the text to UTF-32 so that I could quickly iterate over it and see if each dword corresponds to a valid code point. Is there a function that will do that? If not, is there a list of invalid ranges floating around out there?

like image 399
Sean U Avatar asked Jan 07 '12 03:01

Sean U


2 Answers

It seems like the only way to do it is 'manually' like you've done. Here's a version that gives the same results as yours, but is a bit faster (about 4 times over a string of all chars up to char.MaxValue, less improvement up to U+10FFFF) and doesn't require unsafe code. I've also simplified and commented my IsCharacter method to explain each selection:

static string ReplaceNonCharacters(string aString, char replacement)
{
    var sb = new StringBuilder(aString.Length);
    for (var i = 0; i < aString.Length; i++)
    {
        if (char.IsSurrogatePair(aString, i))
        {
            int c = char.ConvertToUtf32(aString, i);
            i++;
            if (IsCharacter(c))
                sb.Append(char.ConvertFromUtf32(c));
            else
                sb.Append(replacement);
        }
        else
        {
            char c = aString[i];
            if (IsCharacter(c))
                sb.Append(c);
            else
                sb.Append(replacement);
        }
    }
    return sb.ToString();
}

static bool IsCharacter(int point)
{
    return point < 0xFDD0 || // everything below here is fine
        point > 0xFDEF &&    // exclude the 0xFFD0...0xFDEF non-characters
        (point & 0xfffE) != 0xFFFE; // exclude all other non-characters
}
like image 198
porges Avatar answered Nov 10 '22 15:11

porges


I went ahead with the solution hinted at in the edit.

I couldn't find an easy-to-use list of valid ranges in the Unicode space; even the official Unicode character database was going to take more parsing than I really wanted to deal with. So instead I wrote a quick script to loop over every number on the range [0x0, 0x10FFFF], convert it to a string using Encoding.UTF32.GetString(BitConverter.GetBytes(code)), and try .Normalize()ing the result. If an exception is raised, then that value is not a valid code point.

From those results, I created the following function:

bool IsValidCodePoint(UInt32 point)
{
    return (point >= 0x0 && point <= 0xfdcf)
        || (point >= 0xfdf0 && point <= 0xfffd)
        || (point >= 0x10000 && point <= 0x1fffd)
        || (point >= 0x20000 && point <= 0x2fffd)
        || (point >= 0x30000 && point <= 0x3fffd)
        || (point >= 0x40000 && point <= 0x4fffd)
        || (point >= 0x50000 && point <= 0x5fffd)
        || (point >= 0x60000 && point <= 0x6fffd)
        || (point >= 0x70000 && point <= 0x7fffd)
        || (point >= 0x80000 && point <= 0x8fffd)
        || (point >= 0x90000 && point <= 0x9fffd)
        || (point >= 0xa0000 && point <= 0xafffd)
        || (point >= 0xb0000 && point <= 0xbfffd)
        || (point >= 0xc0000 && point <= 0xcfffd)
        || (point >= 0xd0000 && point <= 0xdfffd)
        || (point >= 0xe0000 && point <= 0xefffd)
        || (point >= 0xf0000 && point <= 0xffffd)
        || (point >= 0x100000 && point <= 0x10fffd);
}

Note that this function isn't necessarily great for general-purpose cleanup, depending on your needs. It does not exclude unassigned or reserved code points, just ones that are specifically designated as 'noncharacter' (edit: and some others that Normalize() seems to choke on, such as 0xfffff). However, these seem to be the only code points that will cause IsNormalized() and Normalize() to raise an exception, so it's fine for my purposes.

After that, it's just a matter of converting the string to UTF-32 and combing through it. Since Encoding.GetBytes() returns a byte array and IsValidCodePoint() expects a UInt32, I used an unsafe block and some casting to bridge the gap:

unsafe string ReplaceInvalidCodePoints(string aString, char replacement)
{
    if (char.IsHighSurrogate(replacement) || char.IsLowSurrogate(replacement))
        throw new ArgumentException("Replacement cannot be a surrogate", "replacement");

    byte[] utf32String = Encoding.UTF32.GetBytes(aString);

    fixed (byte* d = utf32String)
    fixed (byte* s = Encoding.UTF32.GetBytes(new[] { replacement }))
    {
        var data = (UInt32*)d;
        var substitute = *(UInt32*)s;

        for(var p = data; p < data + ((utf32String.Length) / sizeof(UInt32)); p++)
        {
            if (!(IsValidCodePoint(*p))) *p = substitute;
        }
    }

    return Encoding.UTF32.GetString(utf32String);
}

Performance is good, comparatively speaking - several orders of magnitude faster than the sample posted in the question. Leaving the data in UTF-16 would presumably have been faster and more memory-efficient, but at the cost of a lot of extra code for dealing with surrogates. And of course having replacement be a char means the replacement character must be on the BMP.

edit: Here's a much more concise version of IsValidCodePoint():

private static bool IsValidCodePoint(UInt32 point)
{
    return point < 0xfdd0
        || (point >= 0xfdf0 
            && ((point & 0xffff) != 0xffff) 
            && ((point & 0xfffe) != 0xfffe)
            && point <= 0x10ffff
        );
}
like image 34
Sean U Avatar answered Nov 10 '22 13:11

Sean U