Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices with reading and operating on fortran ordered arrays with numpy

Tags:

python

numpy

I'm reading ascii and binary files that all specify 3 dimensional arrays in fortran order. I want to perform some arbitrary manipulations on these arrays, then export them to the same ascii or binary format.

I'm confused on the best ways to deal with these arrays in my library. My current design seems prone to error because I have to keep reshaping things from the default C order if any new array is created.

Current design:

I have a few functions that read these files and return numpy arrays. The read functions all behave in a similar way and essentially read in the data and return something like:

return array.reshape((i, j, k), order='F')

The way I understand it, I'm returning a view for fortran order onto the original array.

My code assumes all the arrays are in fortran order. This means any new operations that might create a new array I make sure to use reshape to convert it back to fortran order.

This seems very error-prone because I have to pay close attention to any operation that creates a new array and make sure to reshape it into fortran order since the default is usually C order.

I later might have to export these arrays to binary or ascii again and need to maintain the fortran ordering. So, I use numpy.nditer to write each element out in the fortran order.

Concerns:

  • The current approach seems very error-prone since I typically think in C order. I'm afraid that I'll always be getting bitten by missing calls to reshape that forces things in C order.

    • I'd like to not have to worry about the ordering of the array elements except when reading the input files or writing the data to the output files.
  • The current approach seems messy because the indexes can be interpreted different ways and things can get confusing.

    • When dealing with fortran arrays the tuple ordering for indexes is backwards, right?
    • So, x[(1, 2, 3)] for a fortran array means k = 1, j = 2, and i = 3 whereas x[(1, 2, 3)] for a C order array means k = 3, j = 2, i = 1 correct?
    • This means that me and users of my library must always think of indexes in (k, j, i) order, not what we are C/Python programmers typically think in, (i, j, k).

Question:

Is there a best practice for doing this type of thing? In an ideal world I'd like to read in the fortran ordered arrays, then forget about ordering until I export to a file. However, I'm afraid I'll keep misinterpreting the indexes, etc.

I've read through the only numpy documentation on this that I can find, http://docs.scipy.org/doc/numpy/reference/internals.html#multidimensional-array-indexing-order-issues. However, the concept still seems as clear as mud to me. Maybe I just need a different explanation of the numpy docs, http://docs.scipy.org/doc/numpy/reference/internals.html#multidimensional-array-indexing-order-issues.

like image 409
durden2.0 Avatar asked Mar 13 '14 17:03

durden2.0


People also ask

Is Fortran faster than NumPy?

For the 1,000,000,000 element arrays, the Fortran code (without the O2 flag) was only 3.7% faster than the NumPy code. The parallel Numba code really shines with the 8-cores of the AMD-FX870, which was about 4 times faster than MATLAB, and 3 times faster than Numpy.

Does NumPy have tools for reading or writing array based datasets to disk?

Tools for reading / writing array data to disk and working with memory-mapped files. Linear algebra, random number generation, and Fourier transform capabilities.

What is Fortran order in NumPy?

Fortran order/ array is a special case in which all elements of an array are stored in column-major order. Sometimes we need to display array in fortran order, for this numpy has a function known as numpy. nditer().

What advantage does vectorization with NumPy arrays provide?

The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations.


1 Answers

Numpy abstracts away the difference between Fortran ordering and C-ordering at the python level. (In fact, you can even have other orderings for >2d arrays with numpy. They're all treated the same at the python level.)

The only time you'll need to worry about C vs F ordering is when you're reading/writing to disk or passing the array to lower-level functions.

A Simple Example

As an example, let's make a simple 3D array in both C order and Fortran order:

In [1]: import numpy as np

In [2]: c_order = np.arange(27).reshape(3,3,3)

In [3]: f_order = c_order.copy(order='F')

In [4]: c_order
Out[4]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

In [5]: f_order
Out[5]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

Notice that they both look identical (they are at the level we're interacting with them). How can you tell that they're in different orderings? First off, let's take a look at the flags (pay attention to C_CONTIGUOUS vs F_CONTIGUOUS):

In [6]: c_order.flags
Out[6]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [7]: f_order.flags
Out[7]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

And if you don't trust the flags, you can effectively view the memory order by looking at arr.ravel(order='K'). The order='K' is important. Otherwise, when you call arr.ravel() the output will be in C-order regardless of the memory layout of the array. order='K' uses the memory layout.

In [8]: c_order.ravel(order='K')
Out[8]: 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26])

In [9]: f_order.ravel(order='K')
Out[9]: 
array([ 0,  9, 18,  3, 12, 21,  6, 15, 24,  1, 10, 19,  4, 13, 22,  7, 16,
       25,  2, 11, 20,  5, 14, 23,  8, 17, 26])

The difference is actually represented (and stored) in the strides of the array. Notice that c_order's strides are (72, 24, 8), while f_order's strides are (8, 24, 72).

Just to prove that the indexing works the same way:

In [10]: c_order[0,1,2]
Out[10]: 5

In [11]: f_order[0,1,2]
Out[11]: 5

Reading and Writing

The main place where you'll run into problems with this is when you're reading from or writing to disk. Many file formats expect a particular ordering. I'm guessing that you're working with seismic data formats, and most of them (e.g. Geoprobe .vol's, and I think Petrel's volume format as well) essentially write a binary header and then a Fortran-ordered 3D array to disk.

With that in mind, I'll use a small seismic cube (snippet of some data from my dissertation) as an example.

Both of these are binary arrays of uint8s with a shape of 50x100x198. One is in C-order, while the other is in Fortran-order. c_order.dat f_order.dat

To read them in:

import numpy as np
shape = (50, 100, 198)

c_order = np.fromfile('c_order.dat', dtype=np.uint8).reshape(shape)
f_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape(shape, order='F')

assert np.all(c_order == f_order)

Notice that the only difference is specifying the memory layout to reshape. The memory layout of the two arrays is still different (reshape doesn't make a copy), but they're treated identically at the python level.

Just to prove that the files really are written in a different order:

In [1]: np.fromfile('c_order.dat', dtype=np.uint8)[:10]
Out[1]: array([132, 142, 107, 204,  37,  37, 217,  37,  82,  60], dtype=uint8)

In [2]: np.fromfile('f_order.dat', dtype=np.uint8)[:10]
Out[2]: array([132, 129, 140, 138, 110,  88, 110, 124, 142, 139], dtype=uint8)

Let's visualize the result:

def plot(data):
    fig, axes = plt.subplots(ncols=3)
    for i, ax in enumerate(axes):
        slices = [slice(None), slice(None), slice(None)]
        slices[i] = data.shape[i] // 2
        ax.imshow(data[tuple(slices)].T, cmap='gray_r')
    return fig

plot(c_order).suptitle('C-ordered array')
plot(f_order).suptitle('F-ordered array')
plt.show()

enter image description hereenter image description here

Notice that we indexed them the same way, and they're displayed identically.

Common Mistakes with IO

First off, let's try reading in the Fortran-ordered file as if it were C-ordered and then take a look at the result (using the plot function above):

wrong_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape(shape)
plot(wrong_order)

enter image description here

Not so good!

You mentioned that you're having to use "reversed" indicies. This is probably because you fixed what happened in the figure above by doing something like (note the reversed shape!):

c_order = np.fromfile('c_order.dat', dtype=np.uint8).reshape([50,100,198])
rev_f_order = np.fromfile('f_order.dat', dtype=np.uint8).reshape([198,100,50])

Let's visualize what happens:

plot(c_order).suptitle('C-ordered array')
plot(rev_f_order).suptitle('Incorrectly read Fortran-ordered array')

enter image description hereenter image description here

Note that the image on the far right (the timeslice) of the first plot matches a transposed version of the image on the far left of the second.

Similarly, print rev_f_order[1,2,3] and print c_order[3,2,1] both yield 140, while indexing them the same way gives a different result.

Basically, this is where your reversed indices come from. Numpy thinks it's a C-ordered array with a different shape. Notice if we look at the flags, they're both C-contiguous in memory:

In [24]: rev_f_order.flags
Out[24]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

In [25]: c_order.flags
Out[25]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

This is because a fortran-ordered array is equivalent to a C-ordered array with the reverse shape.

Writing to Disk in Fortran-Order

There's an additional wrinkle when writing a numpy array to disk in Fortran-order.

Unless you specify otherwise, the array will be written in C-order regardless of its memory-layout! (There's a clear note about this in the documentation for ndarray.tofile, but it's a common gotcha. The opposite behavior would be incorrect, though, i.m.o.)

Therefore, regardless of the memory layout of an array, to write it to disk in Fortran order, you need to do:

arr.ravel(order='F').tofile('output.dat')

If you're writing it as ascii, the same applies. Use ravel(order='F') and then write out the 1-dimensional result.

like image 194
Joe Kington Avatar answered Sep 20 '22 17:09

Joe Kington