Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy: efficiently reading a large array

I have a binary file that contains a dense n*m matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy array?

The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m bytes in length) and must be easy to produce from non-Python code.

edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)

like image 574
NPE Avatar asked Dec 06 '10 11:12

NPE


People also ask

How do you deal with a large NumPy array?

Sometimes, we need to deal with NumPy arrays that are too big to fit in the system memory. A common solution is to use memory mapping and implement out-of-core computations. The array is stored in a file on the hard drive, and we create a memory-mapped object to this file that can be used as a regular NumPy array.

Is NumPy array memory efficient?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

How can I speed up my NumPy operation?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

What is the maximum size of NumPy array?

There is no general maximum array size in numpy. Of course there is, it is the size of np. intp datatype. Which for 32bit versions may only be 32bits...


2 Answers

NumPy provides fromfile() to read binary data.

a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:

a = a.reshape((n, m), order="FORTRAN")

[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use

a = a.reshape((m, n)).T

Thanks to Joe Kingtion for pointing this out.]

But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).

HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...

like image 102
Sven Marnach Avatar answered Oct 05 '22 22:10

Sven Marnach


Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.

For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A'). Also see the question How to force numpy array order to fortran style?

An example with the default C indexing (row-major order):

>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])

>>> a.strides
(32, 8)

Indexing using Fortran order (column-major order):

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
>>> a[1]
array([ 1,  4,  7, 10])

>>> a.strides
(8, 24)

The other view

Also, you can always get the other kind of view using the parameter T of an array:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

You can also manually set the strides:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
like image 31
peterhil Avatar answered Oct 05 '22 21:10

peterhil