Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load one line at a time from a pickle file?

I have a large dataset: 20,000 x 40,000 as a numpy array. I have saved it as a pickle file.

Instead of reading this huge dataset into memory, I'd like to only read a few (say 100) rows of it at a time, for use as a minibatch.

How can I read only a few randomly-chosen (without replacement) lines from a pickle file?

like image 920
StatsSorceress Avatar asked Jun 21 '16 20:06

StatsSorceress


People also ask

How do you load pickled data?

Python Pickle load To retrieve pickled data, the steps are quite simple. You have to use pickle. load() function to do that. The primary argument of pickle load function is the file object that you get by opening the file in read-binary (rb) mode.

How do I see the contents of a pickle file in Python?

We can use the pandas library to read a pickle file in Python. The pandas module has a read_pickle() method that can be used to read a pickle file. This method accepts a filepath_or_buffer argument: the file path, the URL, or the buffer from where the pickle file will be loaded.

Is reading pickle faster than CSV?

Pickle: Pickle is the native format of python that is popular for object serialization. The advantage of pickle is that it allows the python code to implement any type of enhancements. It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques.


2 Answers

You can write pickles incrementally to a file, which allows you to load them incrementally as well.

Take the following example. Here, we iterate over the items of a list, and pickle each one in turn.

>>> import cPickle
>>> myData = [1, 2, 3]
>>> f = open('mydata.pkl', 'wb')
>>> pickler = cPickle.Pickler(f)
>>> for e in myData:
...     pickler.dump(e)
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
<cPickle.Pickler object at 0x7f3849818f68>
>>> f.close()

Now we can do the same process in reverse and load each object as needed. For the purpose of example, let's say that we just want the first item and don't want to iterate over the entire file.

>>> f = open('mydata.pkl', 'rb')
>>> unpickler = cPickle.Unpickler(f)
>>> unpickler.load()
1

At this point, the file stream has only advanced as far as the first object. The remaining objects weren't loaded, which is exactly the behavior you want. For proof, you can try reading the rest of the file and see the rest is still sitting there.

>>> f.read()
'I2\n.I3\n.'
like image 166
Alex Smith Avatar answered Oct 30 '22 02:10

Alex Smith


Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.

Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.

The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.

As @PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.

Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).

from __future__ import print_function
import numpy
import random

def dumparray(a, path):
    lines, _ = a.shape
    with open(path, 'wb') as fd:
        for i in range(lines):
            fd.write(a[i,...].tobytes())

class RandomLineAccess(object):
    def __init__(self, path, cols, dtype):
        self.dtype = dtype
        self.fd = open(path, 'rb')
        self.line_length = cols*dtype.itemsize

    def read_line(self, line):
        offset = line*self.line_length
        self.fd.seek(offset)
        data = self.fd.read(self.line_length)

        return numpy.frombuffer(data, self.dtype)

    def close(self):
        self.fd.close()


def main():
    lines = 10
    cols = 10
    path = '/tmp/array'

    a = numpy.zeros((lines, cols))
    dtype = a.dtype

    for i in range(lines):
        # add some data to distinguish lines
        numpy.ndarray.fill(a[i,...], i)

    dumparray(a, path)
    rla = RandomLineAccess(path, cols, dtype)

    line_indices = list(range(lines))
    for _ in range(20):
        line_index = random.choice(line_indices)
        print(line_index, rla.read_line(line_index))

if __name__ == '__main__':
    main()
like image 30
code_onkel Avatar answered Oct 30 '22 02:10

code_onkel