Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unpickling data from Python 2 with unicode strings in Python 3

I have pickled data from 2.7 that I pickled like this:

#!/usr/bin/env python2
# coding=utf-8 

import pickle

data = {1: datetime.date(2014, 3, 18), 
       'string-key': u'ünicode-string'}

pickle.dump(data, open('file.pickle', 'wb'))

The only way I found to load this in Python 3.4 is:

data = pickle.load(open('file.pickle', "rb"), encoding='bytes')

Now my unicode string are fine but the dict keys are bytes. print(repr(data)) gives:

{1: datetime.date(2014, 3, 18), b'string-key': 'ünicode-string'}

Does anybody have an idea to get around rewriting my code like data[b'string-key'] resp. converting all existing files?

like image 535
TNT Avatar asked Apr 03 '14 14:04

TNT


People also ask

How do you Unicode a string in Python 3?

Python3. # ord() for conversion. In this, task of substitution in unicode formatted string is done using format() and ord() is used for conversion.

Does Python 3 have Unicode?

Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!" , 'unicode rocks!'

What is the difference between text encoding in Python 2 and Python 3?

In Python 2, the str type was used for two different kinds of values – text and bytes, whereas in Python 3, these are separate and incompatible types. Text contains human-readable messages, represented as a sequence of Unicode codepoints. Usually, it does not contain unprintable control characters such as \0 .

What is Unicode in Python 2?

The unicode object lets you work with characters. It has all the same methods as the string object. “encoding” is converting from a unicode object to bytes. “decoding” is converting from bytes to a unicode object.


1 Answers

This is not a real answer but only a workaround. This converts pickled data to version 3 in Python 3.4 (doesn't work in 3.3):

#!/usr/bin/env python3

import pickle, glob

def bytes_to_unicode(ob):
    t = type(ob)
    if t in (list, tuple):
        l = [str(i, 'utf-8') if type(i) is bytes else i for i in ob]
        l = [bytes_to_unicode(i) if type(i) in (list, tuple, dict) else i for i in l]
        ro = tuple(l) if t is tuple else l
    elif t is dict:
        byte_keys = [i for i in ob if type(i) is bytes]
        for bk in byte_keys:
            v = ob[bk]
            del(ob[bk])
            ob[str(bk,'utf-8')] = v
        for k in ob:
            if type(ob[k]) is bytes:
                ob[k] = str(ob[k], 'utf-8')
            elif type(ob[k]) in (list, tuple, dict):
                ob[k] = bytes_to_unicode(ob[k])
        ro = ob
    else:
        ro = ob
        print("unprocessed object: {0} {1}".format(t, ob))
    return ro

for fn in glob.glob('*.pickle'):

    data = pickle.load(open(fn, "rb"), encoding='bytes')
    ndata = bytes_to_unicode(data)
    pickle.dump(ndata, open(fn + '3', "wb"))

The Python docs say:

The pickle serialization format is guaranteed to be backwards compatible across Python releases.

I didn't find a way to pickle.load Python-2.7 pickled data in Python 3.3 -- not even data that contained only ints and dates.

like image 108
TNT Avatar answered Oct 02 '22 04:10

TNT