How to load one line at a time from a pickle file?

2 Answers

You can write pickles incrementally to a file, which allows you to load them incrementally as well.

Take the following example. Here, we iterate over the items of a list, and pickle each one in turn.

>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
...     pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()

Now we can do the same process in reverse and load each object as needed. For the purpose of example, let's say that we just want the first item and don't want to iterate over the entire file.

>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1

At this point, the file stream has only advanced as far as the first object. The remaining objects weren't loaded, which is exactly the behavior you want. For proof, you can try reading the rest of the file and see the rest is still sitting there.

>>> f.read()
'I2\n.I3\n.'

166

answered Oct 30 '22 02:10

Alex Smith

Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.

Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.

The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.

As @PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.

Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).

from __future__ import print_function
import numpy
import random

def dumparray(a, path):
    lines, _ = a.shape
    with open(path, 'wb') as fd:
        for i in range(lines):
            fd.write(a[i,...].tobytes())

class RandomLineAccess(object):
    def __init__(self, path, cols, dtype):
        self.dtype = dtype
        self.fd = open(path, 'rb')
        self.line_length = cols*dtype.itemsize

    def read_line(self, line):
        offset = line*self.line_length
        self.fd.seek(offset)
        data = self.fd.read(self.line_length)

        return numpy.frombuffer(data, self.dtype)

    def close(self):
        self.fd.close()


def main():
    lines = 10
    cols = 10
    path = '/tmp/array'

    a = numpy.zeros((lines, cols))
    dtype = a.dtype

    for i in range(lines):
        # add some data to distinguish lines
        numpy.ndarray.fill(a[i,...], i)

    dumparray(a, path)
    rla = RandomLineAccess(path, cols, dtype)

    line_indices = list(range(lines))
    for _ in range(20):
        line_index = random.choice(line_indices)
        print(line_index, rla.read_line(line_index))

if __name__ == '__main__':
    main()

answered Oct 30 '22 02:10

code_onkel

Related questions
                            
                                Manually-defined axis labels for Matplotlib imshow()
                            
                                spark-submit EMR Step failing when submitted using boto3 client
                            
                                'importlib._bootstrap' has no attribute 'SourceLoader'
                            
                                Scikit-Learn: Label not x is present in all training examples
                            
                                Access contents of list after applying Counter from collections module
                            
                                Using eventlet to manage socketio in Flask
                            
                                Is it possible to know the maximum number accepted by chr using Python?
                            
                                Pandas groupby result into multiple columns
                            
                                TKinter OptionMenu: How to get the selected choice?
                            
                                Django Global base.html template
                            
                                Pandas plotting in Windows terminal
                            
                                Substitute dataset coordinates in xarray (Python)
                            
                                matplotlib colorbar tick label formatting
                            
                                python display map with googlemaps
                            
                                Python decorator function execution
                            
                                Python: How can I run eval() in the local scope of a function
                            
                                How to run Anaconda Python on sudo
                            
                                Check if Python Script is already running
                            
                                how to set width on seaborn barplot
                            
                                Unexpected keyword argument in python click

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load one line at a time from a pickle file?

Tags:

python

numpy

pickle

StatsSorceress

People also ask

2 Answers

Alex Smith

code_onkel

Recent Activity

Donate For Us