Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ASP.NET - Unable to translate Unicode character XXX at index YYY to specified code page

Tags:

c#

.net

asp.net

iis

On a ASP.NET 4 website and im getting the following error when trying to load data from the database into a GridView.

Unable to translate Unicode character \uD83D at index 49 to specified code page.

I've found out that this happens when a data row contains: Text Text Text 😊😊

As i understand this text cannot be translated into a valid utf-8 response.

  1. Is that really the reason?

  2. Is there a way to clean the text before loading it into the gridview to prevent such errors?


UPDATE:

I have some progress I've found out I only get this error when I'm using Substring method on a string. (I'm using substring to show part of the text as a preview to the user).

For example in an ASP.NET Web Form I do this:

String txt = test 💔💔;

//txt string can also be created by 
String txt = char.ConvertFromUtf32(116) + char.ConvertFromUtf32(101) +char.ConvertFromUtf32(115) + char.ConvertFromUtf32(116) + char.ConvertFromUtf32(32) + char.ConvertFromUtf32(128148);

// this works ok txt is shown in the webform label.
Label1.Text = txt; 

//length is equal to 7.
Label2.Text = txt.Length.ToString();

//causes exception - Unable to translate Unicode character \uD83D at index 5 to specified code page.
Label3.Text = txt.Substring(0, 6);

I know that .NET string is based on utf-16 which supports surrogate pairs.

When i'm using SubString function I accidently break the surrogate pair and causes the exception. I found out that I can use StringInfo class:

var si = new System.Globalization.StringInfo(txt);
var l = si.LengthInTextElements; // length is equal to 6.
Label3.Text = si.SubstringByTextElements(0, 5); //no exception!

Another alternative is to just delete the surrogate pairs :

Label3.Text = ValidateUtf8(txt).Substring(0, 3); //no exception!

    public static string ValidateUtf8(string txt)
            {
                StringBuilder sbOutput = new StringBuilder();
                char ch;

                for (int i = 0; i < body.Length; i++)
                {
                    ch = body[i];
                    if ((ch >= 0x0020 && ch <= 0xD7FF) ||
                            (ch >= 0xE000 && ch <= 0xFFFD) ||
                            ch == 0x0009 ||
                            ch == 0x000A ||
                            ch == 0x000D)
                    {
                        sbOutput.Append(ch);
                    }

                }
                return sbOutput.ToString();
            }

Is this really a problem of surrogate pairs?

Which characters use surrogate pairs ? is there a list?

Should I keep support for surrogate pairs? should i go with using StringInfo Class or just delete non valid chars?

Thanks!

like image 235
RuSh Avatar asked Mar 19 '12 17:03

RuSh


1 Answers

You could try encoding the text to UTF8 first (in the row bound event or something similar). The following code will encode text in UTF8 and remove un-encodable characters.

private static readonly Encoding Utf8Encoder = Encoding.GetEncoding(
    "UTF-8",
    new EncoderReplacementFallback(string.Empty),
    new DecoderExceptionFallback()
);

var utf8Text = Utf8Encoder.GetString(Utf8Encoder.GetBytes(text));
like image 84
LaserJesus Avatar answered Oct 18 '22 21:10

LaserJesus