Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I stream a Python pickle list, tuple, or other iterable data type?

I work with comma/tab-separated data files often that might look like this:

key1,1,2.02,hello,4
key2,3,4.01,goodbye,6
...

I might read and pre-process this in Python into a list of lists, like this:

[ [ key1, 1, 2.02, 'hello', 4 ], [ key2, 3, 4.01, 'goodbye', 6 ] ]

Sometimes, I like saving this list of lists as a pickle, since it preserves the different types of my entries. If the pickled file is big, though, it would be great to read this list of lists back in a streaming fashion.

In Python, to load a text file as a stream, I use the follwoing to print out each line:

with open( 'big_text_file.txt' ) as f:
    for line in f:
        print line

Can I do something similar for a Python list, i.e.:

import pickle
with open( 'big_pickled_list.pkl' ) as p:
    for entry in pickle.load_streaming( p ): # note: pickle.load_streaming doesn't exist
        print entry

Is there a pickle function like "load_streaming"?

like image 923
wwwilliam Avatar asked Jul 12 '13 20:07

wwwilliam


People also ask

Can you pickle a tuple in Python?

A tuple object is created and pickled using pickle. dump() function. print("The tuple is pickled successfully.") The output shows that the tuple object is pickled successfully.

Can pickle store tuples?

Here are the things that the pickle module store: All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None. Lists, tuples, dictionaries, and sets containing any combination of native datatypes.

Is pickle more efficient than JSON?

Serialization and de-serialization with Pickle is a slower process when compared to other available alternatives. JSON is a lightweight format and is much faster than Pickling.

What is faster than pickle Python?

quickle is a fast and small serialization format for a subset of Python types. It's based off of Pickle, but includes several optimizations and extensions to provide improved performance and security. For supported types, serializing a message with quickle can be ~2-10x faster than using pickle .


2 Answers

This would work.

What is does however is unpickle one object from the file, and then print the rest of the file's content to stdout

What you could do is something like:

import cPickle
with open( 'big_pickled_list.pkl' ) as p:
    try:
        while True:
            print cPickle.load(p)
    except EOFError:
        pass

That would unpickle all objects from the file until reaching EOF.


If you want something that works like for line in f:, you can wrap this up easily:

def unpickle_iter(file):
    try:
        while True:
             yield cPickle.load(file)
    except EOFError:
        raise StopIteration

Now you can just do this:

with open('big_pickled_list.pkl') as file:
    for item in unpickle_iter(file):
        # use item ...
like image 118
mata Avatar answered Oct 27 '22 20:10

mata


To follow up on a comment I made on the accepted solution, I recommend a loop more like this:

import cPickle
with open( 'big_pickled_list.pkl' ) as p:
    while p.peek(1):
        print cPickle.load(p)

This way you'll continue to get the EOFError exception if there is a corrupted object in the file.

For completeness:

def unpickle_iter(file):
    while file.peek(1):
        yield cPickle.load(file)
like image 28
D. A. Avatar answered Oct 27 '22 20:10

D. A.