How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape")
?
For example:
test.py
# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
pass
for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str
Current output:
"A\u0026B"
"Война́ и миръ"
Expected output:
"A&B"
"Война́ и миръ"
How do I define is_escaped_unicode(str)
to determine if the string that's passed is actually escaped unicode?
A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.
Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1 / 'latin-1' , so...
Unicode Escapes. A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged.
In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
You can not.
There is no way to tell if '"A\u0026B"' originally came from some text that was encoded, or if the data are just the bytes '"A\u0026B"', or if we arrived there from some other encoding.
How do ... you know whether or not to run
.decode("unicode-escape")
You have to know if someone earlier has called text.encode('unicode-escape')
. The bytes themselves can not tell you.
You can certainly guess, by looking for \u or \U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route.
If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere.
str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def is_ascii(s):
return all(ord(c) < 128 for c in s)
def is_escaped_unicode(str):
#how do I determine if this is escaped unicode?
if is_ascii(str): # escaped unicode is ascii
return True
return False
for str in arr_all_strings:
if is_escaped_unicode(str):
str = str.decode("unicode-escape")
print str
The following code will work for your case.
Explain:
All string in str_escaped is in Ascii range.
Char in str_unicode do not contain in Ascii range.
Here's a crude way to do it. Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string.
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]
def decoder(s):
y = s.decode('unicode-escape')
return y if len(y) < len(s) else s.decode('utf8')
for s in arr_all_strings:
print s, decoder(s)
output
"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"
But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With