I have a string with escaped Unicode characters, \uXXXX
, and I want to convert it to regular Unicode letters. For example:
"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows Hello world
. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX
in its name.
You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.
To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.
Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.
The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils; @Test public void testUnescapeJava() { String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F"; System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava)); } output: StringEscapeUtils.unescapeJava(sJava): Hello
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With