It was recently asked how to do a file slurp in python, and the accepted answer suggested something like:
with open('x.txt') as x: f = x.read()
How would I go about doing this to read the file in and convert the endian representation of the data?
For example, I have a 1GB binary file that's just a bunch of single precision floats packed as a big endian and I want to convert it to little endian and dump into a numpy array. Below is the function I wrote to accomplish this and some real code that calls it. I use struct.unpack
do the endian conversion and tried to speed everything up by using mmap
.
My question then is, am I using the slurp correctly with mmap
and struct.unpack
? Is there a cleaner, faster way to do this? Right now what I have works, but I'd really like to learn how to do this better.
Thanks in advance!
#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np
def mmapChannel(arrayName, fileName, channelNo, line_count, sample_count):
"""
We need to read in the asf internal file and convert it into a numpy array.
It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
and channels all come from the .meta text file
Also, internal format files are packed big endian, but most systems use little endian, so we need
to make that conversion as well.
Memory mapping seemed to improve the ingestion speed a bit
"""
# memory-map the file, size 0 means whole file
# length = line_count * sample_count * arrayName.itemsize
print "\tMemory Mapping..."
with open(fileName, "rb") as f:
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
map.seek(channelNo*line_count*sample_count*arrayName.itemsize)
for i in xrange(line_count*sample_count):
arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]
# Same method as above, just more verbose for the maintenance programmer.
# for i in xrange(line_count*sample_count): #row
# be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
# le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
# arrayName[0, i]= le_float
map.close()
return arrayName
print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1, line_count*sample_count), dtype='float32')
HHphase = np.ones((1, line_count*sample_count), dtype='float32')
HVamp = np.ones((1, line_count*sample_count), dtype='float32')
HVphase = np.ones((1, line_count*sample_count), dtype='float32')
print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img', 0, line_count, sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img', 1, line_count, sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img', 2, line_count, sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img', 3, line_count, sample_count)
print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
Slightly modified @Alex Martelli's answer:
arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine
with open(fileName, "rb") as f:
arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)
Pretty hard to beat for speed AND conciseness;-). For byteswap see here (the True
argument means, "do it in place"); for fromfile see here.
This works as is on little-endian machines (since the data are big-endian, the byteswap is needed). You can test if that is the case to do the byteswap conditionally, change the last line from an unconditional call to byteswap into, for example:
if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
arrayName.byteswap(True)
i.e., a call to byteswap conditional on a test of little-endianness.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With