<p>I have a binary file that contains a dense <code>n*m</code> matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered <code>numpy</code> array?</p> <p>The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about <code>4*n*m</code> bytes in length) and must be easy to produce from non-Python code.</p> <p><strong>edit</strong>: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)</p>

<p>NumPy provides <code>fromfile()</code> to read binary data.</p> <pre class="prettyprint"><code>a = numpy.fromfile("filename", dtype=numpy.float32) </code></pre> <p>will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered <code>n x m</code> matrix, you can reshape it:</p> <pre class="prettyprint"><code>a = a.reshape((n, m), order="FORTRAN") </code></pre> <p>[EDIT: The <code>reshape()</code> actually copies the data in this case (see the comments). To do it without cpoying, use</p> <pre class="prettyprint"><code>a = a.reshape((m, n)).T </code></pre> <p>Thanks to Joe Kingtion for pointing this out.]</p> <p>But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).</p> <p>HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...</p>

<p>Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.</p> <p>For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.</p> <p>At least Numpy <code>array()</code> and <code>reshape()</code> have an argument for C ('C'), Fortran ('F') or preserved order ('A'). Also see the question How to force numpy array order to fortran style?</p> <h3>An example with the default C indexing (row-major order):</h3> <pre class="prettyprint"><code>>>> a = np.arange(12).reshape(3,4) # <- C order by default >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a[1] array([4, 5, 6, 7]) >>> a.strides (32, 8) </code></pre> <h3>Indexing using Fortran order (column-major order):</h3> <pre class="prettyprint"><code>>>> a = np.arange(12).reshape(3,4, order='F') >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]]) >>> a[1] array([ 1, 4, 7, 10]) >>> a.strides (8, 24) </code></pre> <hr> <h3>The other view</h3> <p>Also, you can always get the other kind of view using the parameter T of an array:</p> <pre class="prettyprint"><code>>>> a = np.arange(12).reshape(3,4, order='C') >>> a.T array([[ 0, 4, 8], [ 1, 5, 9], [ 2, 6, 10], [ 3, 7, 11]]) >>> a = np.arange(12).reshape(3,4, order='F') >>> a.T array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11]]) </code></pre> <hr> <h3>You can also manually set the strides:</h3> <pre class="prettyprint"><code>>>> a = np.arange(12).reshape(3,4, order='C') >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> a.strides (32, 8) >>> a.strides = (8, 24) >>> a array([[ 0, 3, 6, 9], [ 1, 4, 7, 10], [ 2, 5, 8, 11]]) </code></pre>

numpy: efficiently reading a large array

Tags:

performance

python

large-files

numpy

scipy

I have a binary file that contains a dense n*m matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy array?

The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m bytes in length) and must be easy to produce from non-Python code.

edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)

574

asked Dec 06 '10 11:12

NPE

2 Answers

NumPy provides fromfile() to read binary data.

Click to copy

a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:

Click to copy

a = a.reshape((n, m), order="FORTRAN")

[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use

Click to copy

a = a.reshape((m, n)).T

Thanks to Joe Kingtion for pointing this out.]

But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).

HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...

102

answered Oct 05 '22 22:10

Sven Marnach

Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.

For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A'). Also see the question How to force numpy array order to fortran style?

An example with the default C indexing (row-major order):

Click to copy

>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])

>>> a.strides
(32, 8)

Indexing using Fortran order (column-major order):

Click to copy

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
>>> a[1]
array([ 1,  4,  7, 10])

>>> a.strides
(8, 24)

The other view

Also, you can always get the other kind of view using the parameter T of an array:

Click to copy

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

You can also manually set the strides:

Click to copy

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

answered Oct 05 '22 21:10

peterhil

Related questions
                            
                                Preserving the dimensions of a slice from a Numpy 3d array
                            
                                Python syntax error: can't assign to operator in module but works in interpreter
                            
                                Geocoding an address on form submission?
                            
                                Can I compare a template variable to an integer in Django/App Engine templates?
                            
                                How to pack python files and its dependencies in a single executable file?
                            
                                How to deal with "partial" dates (2010-00-00) from MySQL in Django?
                            
                                Downsampling the number of entries in a list (without interpolation)
                            
                                Micropython or minimal python installation
                            
                                Does it make sense to check for identity in __eq__?
                            
                                Regex to match Domain.CCTLD
                            
                                Mutex locks vs Threading locks. Which to use?
                            
                                Efficiently carry out multiple string replacements in Python
                            
                                Can Python be good alternative for web app that would otherwise be done in Java EE? [closed]
                            
                                Is there a Python idiom for evaluating a list of functions/expressions with short-circuiting?
                            
                                How to strip source from distutils binary distributions?
                            
                                C++ vs Python precision
                            
                                Method signature for Jacobian of a least squares function in scipy
                            
                                Why doesn't a %en0 suffix work to connect a link-local IPv6 TCP socket in Python?
                            
                                Add new keys to a dictionary while incrementing existing values
                            
                                What would be the Python code to add time to a specific timestamp?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numpy: efficiently reading a large array

Tags:

performance

python

large-files

numpy

scipy

NPE

People also ask

2 Answers

Sven Marnach

An example with the default C indexing (row-major order):

Indexing using Fortran order (column-major order):

The other view

You can also manually set the strides:

peterhil

Recent Activity

Donate For Us