Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading binary data in python

Firstly, before this question gets marked as duplicate, I'm aware others have asked similar questions but there doesn't seem to be a clear explanation. I'm trying to read in a binary file into an 2D array (documented well here http://nsidc.org/data/docs/daac/nsidc0051_gsfc_seaice.gd.html).

The header is a 300 byte array.

So far, I have;

import struct

with open("nt_197912_n07_v1.1_n.bin",mode='rb') as file:
    filecontent = file.read()

x = struct.unpack("iiii",filecontent[:300])

Throws up an error of string argument length.

like image 578
J W Avatar asked Feb 21 '17 11:02

J W


People also ask

How do you read binary data?

The best way to read a binary number is to start with the right-most digit and work your way left. The power of that first location is zero, meaning the value for that digit, if it's not a zero, is two to the power of zero, or one. In this case, since the digit is a zero, the value for this place would be zero.

Which function is used to read records from a binary file in Python?

rb : Opens the file as read-only in binary format and starts reading from the beginning of the file. While binary format can be used for different purposes, it is usually used when dealing with things like images, videos, etc. r+ : Opens a file for reading and writing, placing the pointer at the beginning of the file.

How do I read a binary csv file in Python?

Per default, Python's built-in open() function opens a text file. If you want to open a binary file, you need to add the 'b' character to the optional mode string argument. To open a file for reading in binary format, use mode='rb' . To open a file for writing in binary format, use mode='rb' .


1 Answers

Reading the Data (Short Answer)

After you have determined the size of the grid (n_rowsxn_cols = 448x304) from your header (see below), you can simply read the data using numpy.frombuffer.

import numpy as np

#...

#Get data from Numpy buffer
dt = np.dtype(('>u1', (n_rows, n_cols)))
x = np.frombuffer(filecontent[300:], dt) #we know the data starts from idx 300 onwards

#Remove unnecessary dimension that numpy gave us
x = x[0,:,:]

The '>u1' specifies the format of the data, in this case unsigned integers of size 1-byte, that are big-endian format.

Plotting this with matplotlib.pyplot

import matplotlib.pyplot as plt

#...

plt.imshow(x, extent=[0,3,-3,3], aspect="auto")
plt.show()

The extent= option simply specifies the axis values, you can change these to lat/lon for example (parsed from your header)

Output

Explanation of Error from .unpack()

From the docs for struct.unpack(fmt, string):

The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt))

You can determine the size specified in the format string (fmt) by looking at the Format Characters section.

Your fmt in struct.unpack("iiii",filecontent[:300]), specifies 4 int types (you can also use 4i = iiii for simplicity), each of which have size 4, requiring a string of length 16.

Your string (filecontent[:300]) is of length 300, whilst your fmt is asking for a string of length 16, hence the error.

Example Usage of .unpack()

As an example, reading your supplied document I extracted the first 21*6 bytes, which has format:

a 21-element array of 6-byte character strings that contain information such as polar stereographic grid characteristics

With:

x = struct.unpack("6s"*21, filecontent[:126])

This returns a tuple of 21 elements. Note the whitespace padding in some elements to meet the 6-byte requirement.

>> print x
    # ('00255\x00', '  304\x00', '  448\x00', '1.799\x00', '39.43\x00', '45.00\x00', '558.4\x00', '154.0\x00', '234.0\x00', '
    # SMMR\x00', '07 cn\x00', '  336\x00', ' 0000\x00', ' 0034\x00', '  364\x00', ' 0000\x00', ' 0046\x00', ' 1979\x00', '  33
    # 6\x00', '  000\x00', '00250\x00')

Notes:

  • The first argument fmt, "6s"*21 is a string with 6s repeated 21 times. Each format-character 6s represents one string of 6-bytes (see below), this will match the required format specified in your document.
  • The number 126 in filecontent[:126] is calculated as 6*21 = 126.
  • Note that for the s (string) specifier, the preceding number does not mean to repeat the format character 6 times (as it would normally for other format characters). Instead, it specifies the size of the string. s represents a 1-byte string, whilst 6s represents a 6-byte string.

More Extensive Solution for Header Reading (Long)

Because the binary data must be manually specified, this may be tedious to do in source code. You can consider using some configuration file (like .ini file)

This function will read the header and store it in a dictionary, where the structure is given from a .ini file

# user configparser for Python 3x
import ConfigParser

def read_header(data, config_file):
    """
    Read binary data specified by a INI file which specifies the structure
    """

    with open(config_file) as fd:

        #Init the config class
        conf = ConfigParser.ConfigParser()
        conf.readfp(fd)

        #preallocate dictionary to store data
        header = {}

        #Iterate over the key-value pairs under the
        #'Structure' section
        for key in conf.options('structure'):

            #determine the string properties
            start_idx, end_idx = [int(x) for x in conf.get('structure', key).split(',')]
            start_idx -= 1 #remember python is zero indexed!
            strLength = end_idx - start_idx

            #Get the data
            header[key] = struct.unpack("%is" % strLength, data[start_idx:end_idx])

            #Format the data
            header[key] = [x.strip() for x in header[key]]
            header[key] = [x.replace('\x00', '') for x in header[key]]

        #Unmap from list-type
        #use .items() for Python 3x
        header = {k:v[0] for k, v in header.iteritems()}

    return header

An example .ini file below. The key is the name to use when storing the data, and the values is a comma-separated pair of values, the first being the starting index and the second being the ending index. These values were taken from Table 1 in your document.

[structure]
missing_data: 1, 6
n_cols: 7, 12
n_rows: 13, 18
latitude_enclosed: 25, 30

This function can be used as follows:

header = read_header(filecontent, 'headerStructure.ini')
n_cols = int(header['n_cols'])
like image 65
Jamie Phan Avatar answered Sep 23 '22 02:09

Jamie Phan