Improve speed of reading and converting from binary file?

Tags:

I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:

# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
    for channel in channel_names:
        channel_content[channel]['recording'].extend(
            [struct.unpack( "h", f.read(2))[0]
            for iteration in xrange(int(channel_content[channel]['nsamples']))])

With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).

What I would like to know:

If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.

Thanks for reading

(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)

Edit: channel_names is a list, so I made the correction suggested by @eumiro (remove typoed brackets)

Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.

Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:

fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
    for channel in self.channel_labels:
        samples = int(self.channel_content[channel]['nsamples'])
        self.channel_content[channel]['recording'].extend(
                                                fullsamples[position:position+samples])
        position += samples

The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.

484

asked Apr 27 '11 12:04

heltonbiker

1 Answers

A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.

In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:

Creating a Struct object once and calling its methods is more efficient than calling the struct functions with the same format since the format string only needs to be compiled once.

So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:

chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
    chunkdata = structobj.unpack(fileobj.read(structobj.size))

(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)

172

answered Oct 06 '22 15:10

Karim Bahgat

Related questions
                            
                                Compare column names of Pandas Dataframe
                            
                                Copy highlighted text to clipboard, then use the clipboard to append it to a list
                            
                                plotting 3D surface using python: raise ValueError("Argument Z must be 2-dimensional.") matplotlib [duplicate]
                            
                                Complexity of f.seek() in Python
                            
                                How do I check when my next Airflow DAG run has been scheduled for a specific dag?
                            
                                Validating input when mutating a dataclass
                            
                                PyTorch torch.max over multiple dimensions
                            
                                Could not build wheels for _ which use PEP 517 and cannot be installed directly - Easy Solution
                            
                                Experiences of creating Social Network site in Django
                            
                                What permissions are required for subprocess.Popen?
                            
                                Listing installed python site-packages? [duplicate]
                            
                                Python time objects with more than 24 hours
                            
                                Python reclaiming memory after deleting items in a dictionary
                            
                                Python: list comprehension, do f(x) if x exists?
                            
                                Numpy *.npz internal file structure
                            
                                How to run 'python setup.py install' from within Python?
                            
                                django query based on dynamic property()
                            
                                Migrating to pip+virtualenv from setuptools
                            
                                Python Run a daemon sub-process & read stdout
                            
                                Python: Return 2 ints for index in 2D lists given item

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improve speed of reading and converting from binary file?

Tags:

performance

python

file-io

heltonbiker

People also ask

1 Answers

Karim Bahgat

Recent Activity

Donate For Us