Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to shorten UTF8 string based on byte length

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.

I ran into a problem where I'd receive this error message when inserting a particular field:

ORA-12899 Value too large for column X

I used Field.Substring(0, MaxLength); but still got the error (though not for every record).

Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.

This gets me to my question. What is the best way to trim my string to fix the MaxLength?

My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?

like image 791
Michael La Voie Avatar asked Aug 03 '09 23:08

Michael La Voie


1 Answers

Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).

If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}
like image 76
Daniel Brückner Avatar answered Nov 03 '22 18:11

Daniel Brückner