Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if a string is escaped unicode

How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape")?

For example:

test.py

# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    pass

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

Current output:

"A\u0026B"
"Война́ и миръ"

Expected output:

"A&B"
"Война́ и миръ"

How do I define is_escaped_unicode(str) to determine if the string that's passed is actually escaped unicode?

like image 922
Ben McCormack Avatar asked Aug 12 '17 14:08

Ben McCormack


People also ask

What is escaped Unicode?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

How do I decode a string with escaped Unicode?

Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1 / 'latin-1' , so...

What is Unicode escape in Java?

Unicode Escapes. A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged.

What is Unicode escape Python?

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk'. Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.


3 Answers

You can not.

There is no way to tell if '"A\u0026B"' originally came from some text that was encoded, or if the data are just the bytes '"A\u0026B"', or if we arrived there from some other encoding.

How do ... you know whether or not to run .decode("unicode-escape")

You have to know if someone earlier has called text.encode('unicode-escape'). The bytes themselves can not tell you.

You can certainly guess, by looking for \u or \U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route.

If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere.

like image 194
wim Avatar answered Oct 25 '22 04:10

wim


str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    if is_ascii(str): # escaped unicode is ascii
        return True
    return False

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

The following code will work for your case.

Explain:

  • All string in str_escaped is in Ascii range.

  • Char in str_unicode do not contain in Ascii range.

like image 42
Haha TTpro Avatar answered Oct 25 '22 04:10

Haha TTpro


Here's a crude way to do it. Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string.

str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]

def decoder(s):
    y = s.decode('unicode-escape')
    return y if len(y) < len(s) else s.decode('utf8')

for s in arr_all_strings:
    print s, decoder(s)

output

"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"

But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.

like image 45
PM 2Ring Avatar answered Oct 25 '22 05:10

PM 2Ring