Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I read successive arrays from a binary file using `np.fromfile`?

Tags:

python

numpy

I want to read a binary file in Python, the exact layout of which is stored in the binary file itself.

The file contains a sequence of two-dimensional arrays, with the row and column dimensions of each array stored as a pair of integers preceding its contents. I want to successively read all of the arrays contained within the file.

I know this can be done with f = open("myfile", "rb") and f.read(numberofbytes), but this is quite clumsy because I would then need to convert the output into meaningful data structures. I would like to use numpy's np.fromfile with a custom dtype, but have not found a way to read part of the file, leaving it open, and then continue reading with a modified dtype.

I know I can use os to f.seek(numberofbytes, os.SEEK_SET) and np.fromfile multiple times, but this would mean a lot of unnecessary jumping around in the file.

In short, I want MATLAB's fread (or at least something like C++ ifstream read).

What is the best way to do this?

like image 844
jacob Avatar asked Jul 03 '15 22:07

jacob


People also ask

How does Numpy Fromfile work?

NumPy Input and Output: fromfile() function The fromfile() function is used to construct an array from data in a text or binary file. Open file object or filename. Data type of the returned array. For binary files, it is used to determine the size and byte-order of the items in the file.

How do I read a binary file in Python?

To read from a binary file, we need to open it with the mode rb instead of the default mode of rt : >>> with open("exercises. zip", mode="rb") as zip_file: ... contents = zip_file. read() ...


1 Answers

You can pass an open file object to np.fromfile, read the dimensions of the first array, then read the array contents (again using np.fromfile), and repeat the process for additional arrays within the same file.

For example:

import numpy as np
import os

def iter_arrays(fname, array_ndim=2, dim_dtype=np.int, array_dtype=np.double):

    with open(fname, 'rb') as f:
        fsize = os.fstat(f.fileno()).st_size

        # while we haven't yet reached the end of the file...
        while f.tell() < fsize:

            # get the dimensions for this array
            dims = np.fromfile(f, dim_dtype, array_ndim)

            # get the array contents
            yield np.fromfile(f, array_dtype, np.prod(dims)).reshape(dims)

Example usage:

# write some random arrays to an example binary file
x = np.random.randn(100, 200)
y = np.random.randn(300, 400)

with open('/tmp/testbin', 'wb') as f:
    np.array(x.shape).tofile(f)
    x.tofile(f)
    np.array(y.shape).tofile(f)
    y.tofile(f)

# read the contents back
x1, y1 = iter_arrays('/tmp/testbin')

# check that they match the input arrays
assert np.allclose(x, x1) and np.allclose(y, y1)

If the arrays are large, you might consider using np.memmap with the offset= parameter in place of np.fromfile to get the contents of the arrays as memory-maps rather than loading them into RAM.

like image 119
ali_m Avatar answered Sep 24 '22 07:09

ali_m