I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest.
Which is the fastest method, and why?
import
on a .py file that just contains the list assigned to a variablecPickle
's load
numpy
?)Also, how can one benchmark such things reliably?
Addendum: measuring this reliably is difficult, because import
is cached so it can't be executed multiple times in a test. The loading with pickle also gets faster after the first time probably because page-precaching by the OS. Loading 1 million numbers with cPickle
takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script.
Intuitively I feel cPickle
should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think).
And yes, it's important for me that this performs quickly.
Thanks
Note: You can serialize any Python data structure such as dictionaries, tuples, lists, integer numbers, strings, sets, and floating-point numbers.
Use json. dumps() to serialize a list into a JSON object. Use json. dumps(list) to serialize list into a JSON string.
Afterward, to serialize a Python object such as a dictionary and store the byte stream as a file, we can use pickle's dump() method.
Use the cls kwarg of the json. dump() and json. dumps() method to call our custom JSON Encoder, which will convert NumPy array into JSON formatted data. To serialize Numpy array into JSON we need to convert it into a list structure using a tolist() function.
I would guess cPickle will be fastest if you really need the thing in a list.
If you can use an array, which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers:
from array import array
from datetime import datetime
def WriteInts(theArray,filename):
f = file(filename,"wb")
theArray.tofile(f)
f.close()
def ReadInts(filename):
d = datetime.utcnow()
theArray = array('i')
f = file(filename,"rb")
try:
theArray.fromfile(f,1000000000)
except EOFError:
pass
print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
return theArray
if __name__ == "__main__":
a = array('i')
a.extend(range(0,1000000))
filename = "a_million_ints.dat"
WriteInts(a,filename)
r = ReadInts(filename)
print "The 5th element is %d" % (r[4])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With