I have a binary file that contains a dense n*m
matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy
array?
The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m
bytes in length) and must be easy to produce from non-Python code.
edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)
Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.
NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.
By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.
There is no general maximum array size in numpy. Of course there is, it is the size of np. intp datatype. Which for 32bit versions may only be 32bits...
NumPy provides fromfile()
to read binary data.
a = numpy.fromfile("filename", dtype=numpy.float32)
will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m
matrix, you can reshape it:
a = a.reshape((n, m), order="FORTRAN")
[EDIT: The reshape()
actually copies the data in this case (see the comments). To do it without cpoying, use
a = a.reshape((m, n)).T
Thanks to Joe Kingtion for pointing this out.]
But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).
HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...
Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.
For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.
At least Numpy array()
and reshape()
have an argument for C ('C'), Fortran ('F') or preserved order ('A').
Also see the question How to force numpy array order to fortran style?
>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])
>>> a.strides
(32, 8)
>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
>>> a[1]
array([ 1, 4, 7, 10])
>>> a.strides
(8, 24)
Also, you can always get the other kind of view using the parameter T of an array:
>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0, 4, 8],
[ 1, 5, 9],
[ 2, 6, 10],
[ 3, 7, 11]])
>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With