I have the following String
of characters.
string s = "\\u0625\\u0647\\u0644";
When I print the above sequence, I get:
\u0625\u0647\u062
How can I get the real printable Unicode characters instead of this \uxxxx representation?
The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.
Unicode is an international standard of character encoding which has the capability of representing a majority of written languages all over the globe. Unicode uses hexadecimal to represent a character. Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
If you really don't control the string, then you need to replace those escape sequences with their values:
Regex.Replace(s, @"\u([0-9A-Fa-f]{4})", m => ((char)Convert.ToInt32(m.Groups[1].Value, 16)).ToString());
and hope that you don't have \\
escapes in there too.
Asker posted this as an answer to their question:
I have found the answer:
s = System.Text.RegularExpressions.Regex.Unescape(s);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With