On a ASP.NET 4 website and im getting the following error when trying to load data from the database into a GridView.
Unable to translate Unicode character \uD83D at index 49 to specified code page.
I've found out that this happens when a data row contains: Text Text Text 😊😊
As i understand this text cannot be translated into a valid utf-8 response.
Is that really the reason?
Is there a way to clean the text before loading it into the gridview to prevent such errors?
UPDATE:
I have some progress I've found out I only get this error when I'm using Substring method on a string. (I'm using substring to show part of the text as a preview to the user).
For example in an ASP.NET Web Form I do this:
String txt = test 💔💔;
//txt string can also be created by
String txt = char.ConvertFromUtf32(116) + char.ConvertFromUtf32(101) +char.ConvertFromUtf32(115) + char.ConvertFromUtf32(116) + char.ConvertFromUtf32(32) + char.ConvertFromUtf32(128148);
// this works ok txt is shown in the webform label.
Label1.Text = txt;
//length is equal to 7.
Label2.Text = txt.Length.ToString();
//causes exception - Unable to translate Unicode character \uD83D at index 5 to specified code page.
Label3.Text = txt.Substring(0, 6);
I know that .NET string is based on utf-16 which supports surrogate pairs.
When i'm using SubString function I accidently break the surrogate pair and causes the exception. I found out that I can use StringInfo class:
var si = new System.Globalization.StringInfo(txt);
var l = si.LengthInTextElements; // length is equal to 6.
Label3.Text = si.SubstringByTextElements(0, 5); //no exception!
Another alternative is to just delete the surrogate pairs :
Label3.Text = ValidateUtf8(txt).Substring(0, 3); //no exception!
public static string ValidateUtf8(string txt)
{
StringBuilder sbOutput = new StringBuilder();
char ch;
for (int i = 0; i < body.Length; i++)
{
ch = body[i];
if ((ch >= 0x0020 && ch <= 0xD7FF) ||
(ch >= 0xE000 && ch <= 0xFFFD) ||
ch == 0x0009 ||
ch == 0x000A ||
ch == 0x000D)
{
sbOutput.Append(ch);
}
}
return sbOutput.ToString();
}
Is this really a problem of surrogate pairs?
Which characters use surrogate pairs ? is there a list?
Should I keep support for surrogate pairs? should i go with using StringInfo Class or just delete non valid chars?
Thanks!
You could try encoding the text to UTF8 first (in the row bound event or something similar). The following code will encode text in UTF8 and remove un-encodable characters.
private static readonly Encoding Utf8Encoder = Encoding.GetEncoding(
"UTF-8",
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
);
var utf8Text = Utf8Encoder.GetString(Utf8Encoder.GetBytes(text));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With