How to determine if a string is escaped unicode

Tags:

How do you determine if a string contains escaped unicode so you know whether or not to run .decode("unicode-escape")?

For example:

test.py

# -*- coding: utf-8 -*-
str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    pass

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

Current output:

"A\u0026B"
"Война́ и миръ"

Expected output:

"A&B"
"Война́ и миръ"

How do I define is_escaped_unicode(str) to determine if the string that's passed is actually escaped unicode?

922

asked Aug 12 '17 14:08

3 Answers

You can not.

There is no way to tell if '"A\u0026B"' originally came from some text that was encoded, or if the data are just the bytes '"A\u0026B"', or if we arrived there from some other encoding.

How do ... you know whether or not to run .decode("unicode-escape")

You have to know if someone earlier has called text.encode('unicode-escape'). The bytes themselves can not tell you.

You can certainly guess, by looking for \u or \U escape sequences, or by just try/except the decoding and see what happens, but I don't recommend to go down this route.

If you encounter a bytestring in your application, and you don't already know what the encoding is, then your problem lies elsewhere and should be fixed elsewhere.

194

answered Oct 25 '22 04:10

wim

str_escaped = u'"A\u0026B"'
str_unicode = '"Война́ и миръ"'

arr_all_strings = [str_escaped, str_unicode]

def is_ascii(s):
    return all(ord(c) < 128 for c in s)

def is_escaped_unicode(str):
    #how do I determine if this is escaped unicode?
    if is_ascii(str): # escaped unicode is ascii
        return True
    return False

for str in arr_all_strings:
    if is_escaped_unicode(str):
        str = str.decode("unicode-escape")
    print str

The following code will work for your case.

Explain:

All string in str_escaped is in Ascii range.
Char in str_unicode do not contain in Ascii range.

answered Oct 25 '22 04:10

Haha TTpro

Here's a crude way to do it. Try decoding as unicode-escape, and if that succeeds the resulting string will be shorter than the original string.

str_escaped = '"A\u0026B"'
str_unicode = '"Война́ и миръ"'
arr_all_strings = [str_escaped, str_unicode]

def decoder(s):
    y = s.decode('unicode-escape')
    return y if len(y) < len(s) else s.decode('utf8')

for s in arr_all_strings:
    print s, decoder(s)

output

"A\u0026B" "A&B"
"Война и миръ" "Война и миръ"

But seriously, you'll save yourself a lot of pain if you can migrate to Python 3. And if you can't immediately migrate to Python 3, you may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.

answered Oct 25 '22 05:10

PM 2Ring

Related questions
                            
                                Python - too many 'elif: return()' statements?
                            
                                Reading an excel with pandas basing on columns' colors
                            
                                Code challenge: finding the divisible in a list
                            
                                Pandas boolean indexing error with .drop()
                            
                                How to transform a list of dicts into a list of tuples?
                            
                                To add a new line before a set of characters in a line using python
                            
                                Cartopy: Drawing the coastlines with a country border removed
                            
                                Django DRF Update User
                            
                                Is pandas.DataFrame.columns.values.tolist() the same as pandas.DataFrame.columns.tolist()
                            
                                Pandas update column with array
                            
                                Find nearest white pixel to a given pixel location openCV
                            
                                Apache-Beam + Python: Writing JSON (or dictionaries) strings to output file
                            
                                Django REST Swagger HTTPS requests
                            
                                Pandas Merge two DataFrames without some columns
                            
                                Django – generate a plain text version of an html email
                            
                                SQLAlchemy NOT exists on subselect?
                            
                                Write in Gstreamer pipeline from opencv in python
                            
                                Getting invalid function name warning using Python
                            
                                Filter Pandas DataFrame by comparing columns in a row
                            
                                Remove/hiding username field in django admin edit user form

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to determine if a string is escaped unicode

Tags:

python

unicode

python-2.x

Ben McCormack

People also ask

3 Answers

wim

Haha TTpro

PM 2Ring

Recent Activity

Donate For Us