Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading direct access binary file format in Python

Background:

A binary file is read on a Linux machine using the following Fortran code:

        parameter(nx=720, ny=360, nday=365)
c 
        dimension tmax(nx,ny,nday),nmax(nx,ny,nday)
        dimension tmin(nx,ny,nday),nmin(nx,ny,nday)
c 
        open(10,
     &file='FILE',
     &access='direct',recl=nx*ny*4)
c
        do k=1,nday
        read(10,rec=(k-1)*4+1)((tmax(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+2)((nmax(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+3)((tmin(i,j,k),i=1,nx),j=1,ny) 
        read(10,rec=(k-1)*4+4)((nmin(i,j,k),i=1,nx),j=1,ny) 
        end do

File Details:

options  little_endian
title global daily analysis (grid box mean, the grid shown is the center of the grid box)
undef -999.0
xdef 720 linear    0.25 0.50
ydef 360  linear -89.75 0.50
zdef 1 linear 1 1
tdef 365 linear 01jan2015 1dy
vars 4
tmax     1  00 daily maximum temperature (C)
nmax     1  00 number of reports for maximum temperature (C)
tmin     1  00 daily minimum temperature (C)
nmin     1  00 number of reports for minimum temperature (C)
ENDVARS

Attempts at a solution:

I am trying to parse this into an array in python using the following code (purposely leaving out two attributes):

with gzip.open("/FILE.gz", "rb") as infile:
     data = numpy.frombuffer(infile.read(), dtype=numpy.dtype('<f4'), count = -1)

while x <= len(data) / 4:
    tmax.append(data[(x-1)*4])
    tmin.append(data[(x-1)*4 + 2])
    x += 1

data_full = zip(tmax, tmin)

When testing some records, the data does not seem to line up with some sample records from the file when using Fortran. I have also tried dtype=numpy.float32 as well with no success. It seems as though I am reading the file in correctly in terms of number of observations though. I was also using struct before I learned the file was created with Fortran. That was not working

There are similar questions out here, some of which have answers that I have tried adapting with no luck.

UPDATE

I am a little bit closer after trying out this code:

#Define numpy variables and empty arrays
nx = 720 #number of lon
ny = 360 #number of lat
nday = 0 #iterate up to 364 (or 365 for leap year)   
tmax = numpy.empty([0], dtype='<f', order='F')
tmin = numpy.empty([0], dtype='<f', order='F')

#Parse the data into numpy arrays, shifting records as the date increments
while nday < 365:
    tmax = numpy.append(tmax, data[(nx*ny)*nday:(nx*ny)*(nday + 1)].reshape((nx,ny), order='F'))
    tmin = numpy.append(tmin, data[(nx*ny)*(nday + 2):(nx*ny)*(nday + 3)].reshape((nx,ny), order='F'))
    nday += 1  

I get the correct data for the first day, but for the second day I get all zeros, the third day the max is lower than the min, and so on.

like image 898
Mark P. Avatar asked Sep 26 '18 14:09

Mark P.


2 Answers

While the exact format of Fortran binary files is compiler dependent, in all cases I'm aware of direct access files (files opened with access='direct' as in this question) do not have any record markers between records. Each record is of a fixed size, as given by the recl= specifier in the OPEN statement. That is, the record N starts at offset (N - 1) * RECL bytes in the file.

One portability gotcha is that the unit of the recl= is in terms of file storage units. For most compilers, the file storage unit specifies the size in 8-bit octets (as recommended in recent versions of the Fortran standard), but for the Intel Fortran compiler, recl= is in units of 32 bits; there is a commandline option -assume byterecl which can be used to make Intel Fortran match most other compilers.

So in the example given here and assuming a 8-bit file storage unit, your recl would be 1036800 bytes.

Further, looking at the code, it seems to assume the arrays are of a 4-byte type (e.g. integer or single precision real). So if it's single precision real, and the file has been created in little endian, then the numpy dtype <f4 that you have used seems to be the correct choice.

Now, getting back to the Intel Fortran compiler gotcha, if the file has been created by ifort without -assume byterecl then the data you want will be in the first quarter of each record, with the rest being padding (all zeros or maybe even random data?). Then you'll have to do some extra gymnastics to extract the correct data in python and not the padding. It should be easy to check this by checking the size of the file, is it nx * ny * 4 * nday *4 or is it nx * ny * 4 * nday * 4 * 4 bytes?

like image 95
janneb Avatar answered Nov 13 '22 14:11

janneb


After the Update in my question, I realize I had an error with how I was looping. I of course spotted this about 10 minutes after issuing a bounty, aw well.

The error is with using the day to iterate through the records. This will not work as it iterates once per loop, not pushing the records far enough. Hence why some mins were higher than maxes. The new code is:

while nday < 365:
    tmax = numpy.append(tmax, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
    rm = rm + 2
    tmin = numpy.append(tmin, data[(nx*ny)*rm:(nx*ny)*(rm + 1)].reshape((nx,ny), order='F'))
    rm = rm + 2
    nday += 1 

This used a Record Mover (or rm as I call it) to move the records the appropriate amount. That was all it needed.

like image 44
Mark P. Avatar answered Nov 13 '22 14:11

Mark P.