Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a string with Unicode encoding to a string of letters

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:

"\u0048\u0065\u006C\u006C\u006F World" 

should become

"Hello World" 

I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

like image 439
SharonBL Avatar asked Jun 21 '12 19:06

SharonBL


People also ask

How do you make a string containing Unicode characters?

You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.

How do you convert a Unicode character to a string in Python?

To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.

What is the difference between string and Unicode string?

Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes. Standard Python strings are really byte strings, and a Python character is really a byte.

Can Unicode be converted to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.


1 Answers

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.

import org.apache.commons.lang.StringEscapeUtils;  @Test public void testUnescapeJava() {     String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";     System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava)); }    output:  StringEscapeUtils.unescapeJava(sJava):  Hello 
like image 64
Tony Avatar answered Oct 23 '22 11:10

Tony