Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python list serialization - fastest method

I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest.

Which is the fastest method, and why?

  1. Using import on a .py file that just contains the list assigned to a variable
  2. Using cPickle's load
  3. Some other method (perhaps numpy?)

Also, how can one benchmark such things reliably?

Addendum: measuring this reliably is difficult, because import is cached so it can't be executed multiple times in a test. The loading with pickle also gets faster after the first time probably because page-precaching by the OS. Loading 1 million numbers with cPickle takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script.

Intuitively I feel cPickle should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think).

And yes, it's important for me that this performs quickly.

Thanks

like image 750
Eli Bendersky Avatar asked Feb 17 '09 13:02

Eli Bendersky


People also ask

Is list serializable in Python?

Note: You can serialize any Python data structure such as dictionaries, tuples, lists, integer numbers, strings, sets, and floating-point numbers.

How do you serialize a list in Python?

Use json. dumps() to serialize a list into a JSON object. Use json. dumps(list) to serialize list into a JSON string.

Which method is used for serialization in Python?

Afterward, to serialize a Python object such as a dictionary and store the byte stream as a file, we can use pickle's dump() method.

How do you serialize an array in Python?

Use the cls kwarg of the json. dump() and json. dumps() method to call our custom JSON Encoder, which will convert NumPy array into JSON formatted data. To serialize Numpy array into JSON we need to convert it into a list structure using a tolist() function.


1 Answers

I would guess cPickle will be fastest if you really need the thing in a list.

If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])
like image 142
Carlos A. Ibarra Avatar answered Oct 23 '22 14:10

Carlos A. Ibarra