Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accidentally splitting unicode chars when truncating strings

I'm saving some strings from a third party into my database (postgres). Sometimes these strings are too long and need to be truncated to fit into the column in my table.

On some random occasions I accidentally truncate the string right where there is a Unicode character, which gives me a "broken" string that I cannot save into the database. I get the following error: Unable to translate Unicode character \uD83D at index XXX to specified code page.

I've created a minimal example to show you what I mean. Here I have a string that contains a Unicode character ("Small blue diamond" 🔹 U+1F539). Depending on where I truncate, it gives me a valid string or not.

var myString = @"This is a string before an emoji:🔹 This is after the emoji.";

var brokenString = myString.Substring(0, 34);
// Gives: "This is a string before an emoji:☐"

var test3 = myString.Substring(0, 35);
// Gives: "This is a string before an emoji:🔹"

Is there a way for me to truncate the string without accidentally breaking any Unicode chars?

like image 878
Joel Avatar asked Sep 29 '17 08:09

Joel


People also ask

How do you shorten long strings?

Make a loop at the end of the string After cutting the string at the proper length, take the end of the string and tie a knot at the very end, then fold the string over and tie a loop, about the same size as the original loop (about 2cm in diameter).

What is Unicode in C?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers.

How do you shorten a string in HTML?

Essentially, you check the length of the given string. If it's longer than a given length n , clip it to length n ( substr or slice ) and add html entity &hellip; (…) to the clipped string. function truncate( str, n, useWordBoundary ){ if (str. length <= n) { return str; } const subString = str.


1 Answers

A Unicode character may be represented with several chars, that is the problem with string.Substring you are having.

You may convert your string to a StringInfo object and then use SubstringByTextElements() method to get the substring based on the Unicode character count, not a char count.

See a C# demo:

Console.WriteLine("🔹".Length); // => 2
Console.WriteLine(new StringInfo("🔹").LengthInTextElements); // => 1

var myString = @"This is a string before an emoji:🔹This is after the emoji.";
var teMyString = new StringInfo(myString);
Console.WriteLine(teMyString.SubstringByTextElements(0, 33));
// => "This is a string before an emoji:"
Console.WriteLine(teMyString.SubstringByTextElements(0, 34));
// => This is a string before an emoji:🔹
Console.WriteLine(teMyString.SubstringByTextElements(0, 35));
// => This is a string before an emoji:🔹T
like image 53
Wiktor Stribiżew Avatar answered Sep 23 '22 00:09

Wiktor Stribiżew