Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Convert complex dictionary of strings from Unicode to ASCII [duplicate]

Possible Duplicate:
How to get string Objects instead Unicode ones from JSON in Python?

I have a lot of input as multi-level dictionaries parsed from JSON API calls. The strings are all in unicode which means there is a lot of u'stuff like this'. I am using jq to play around with the results and need to convert these results to ASCII.

I know I can write a function to just convert it like that:

def convert(input):
    if isinstance(input, dict):
        ret = {}
        for stuff in input:
            ret = convert(stuff)
    elif isinstance(input, list):
        ret = []
        for i in range(len(input))
            ret = convert(input[i])
    elif isinstance(input, str):
        ret = input.encode('ascii')
    elif :
        ret = input
    return ret

Is this even correct? Not sure. That's not what I want to ask you though.

What I'm asking is, this is a typical brute-force solution to the problem. There must be a better way. A more pythonic way. I'm no expert on algorithms, but this one doesn't look particularly fast either.

So is there a better way? Or if not, can this function be improved...?


Post-answer edit

Mark Amery's answer is correct but I would like to post a modified version of it. His function works on Python 2.7+ and I'm on 2.6 so had to convert it:

def convert(input):
    if isinstance(input, dict):
        return dict((convert(key), convert(value)) for key, value in input.iteritems())
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input
like image 806
Dreen Avatar asked Oct 27 '12 15:10

Dreen


1 Answers

Recursion seems like the way to go here, but if you're on python 2.xx you want to be checking for unicode, not str (the str type represents a string of bytes, and the unicode type a string of unicode characters; neither inherits from the other and it is unicode-type strings that are displayed in the interpreter with a u in front of them).

There's also a little syntax error in your posted code (the trailing elif: should be an else), and you're not returning the same structure in the case where input is either a dictionary or a list. (In the case of a dictionary, you're returning the converted version of the final key; in the case of a list, you're returning the converted version of the final element. Neither is right!)

You can also make your code pretty and Pythonic by using comprehensions.

Here, then, is what I'd recommend:

def convert(input):
    if isinstance(input, dict):
        return {convert(key): convert(value) for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [convert(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

One final thing. I changed encode('ascii') to encode('utf-8'). My reasoning is as follows: any unicode string that contains only characters in the ASCII character set will be represented by the same byte string when encoded in ASCII as when encoded in utf-8, so using utf-8 instead of ASCII cannot break anything and the change will be invisible as long as the unicode strings you're dealing with use only ASCII characters. However, this change extends the scope of the function to be able to handle strings of characters from the entire unicode character set, rather than just ASCII ones, should such a thing ever be necessary.

like image 104
Mark Amery Avatar answered Sep 18 '22 21:09

Mark Amery