efficient numpy.fromfile on zipped files?

Tags:

I have some large (even gzipped around 10GB) files, which contain an ASCII header and then in principle numpy.recarrays of about 3MB each, we call them "events". My first approach looked like this:

Click to copy

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
event = np.fromfile( f, dtype = event_dtype, count=1 )

However, this is not possible, since np.fromfile needs a real FILE object, because it really makes low level calls (found a pretty old ticket https://github.com/numpy/numpy/issues/1103).

So as I understand I have to do it like this:

Click to copy

s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)

And yes, it works! But isn't this awfully inefficient? Isn't the mem for s allocated, and garbage collected for every event? On my laptop I reach something like 16 events/s, i.e. ~50MB/s

I wonder if anybody knows a smart way, to allocate the mem once and then let numpy read directly into that mem.

Btw. I'm a physicist, so ... well still a newbie in this business.

975

asked Apr 12 '13 08:04

Dominik Neise

1 Answers

@Bakuriu is probably correct that this is probably a micro-optimization. Your bottleneck is almost definitely IO, and after that, decompression. Allocating the memory twice probably isn't significant.

However, if you wanted to avoid the extra memory allocation, you could use numpy.frombuffer to view the string as a numpy array.

This avoids duplicating memory (the string and the array use the same memory buffer), but the array will be read-only, by default. You can then change it to allow writing, if you need to.

In your case, it would be as simple as replacing fromstring with frombuffer:

Click to copy

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
s = f.read( event_dtype.itemsize )
event = np.frombuffer(s, dtype=event_dtype, count=1)

Just to prove that memory is not duplicated using this approach:

Click to copy

import numpy as np

x = "hello"
y = np.frombuffer(x, dtype=np.uint8)

# Make "y" writeable...
y.flags.writeable = True

# Prove that we're using the same memory
y[0] = 121
print x # <-- Notice that we're outputting changing y and printing x...

This yields: yello instead of hello.

Regardless of whether or not it's a significant optimization in this particular case, it's a useful approach to be aware of.

109

answered Oct 26 '22 17:10

Joe Kington

Related questions
                            
                                Python pdb (debugger) disp equivalent?
                            
                                Export GMail Contacts via Unattended Script
                            
                                segmented linear regression in python
                            
                                How Can I Downgrade from Python 3.2 to 2.7?
                            
                                Django: Display values of the selected multiple choice field in a template
                            
                                pandas reading csv orientation
                            
                                Image resize using PIL changes colors drastically
                            
                                PGP-signing multipart e-mails with Python
                            
                                How to change the default version of python in a linux machine ?(not just symlink) [closed]
                            
                                Using git to Track changes to dropbox?
                            
                                matplotlib: faster PDF generation?
                            
                                using python urllib2 to send POST request and get response
                            
                                Python/Regex - Match .#,#. in String
                            
                                Can not get simplest pipeline example to work in scrapy
                            
                                Set "publish to web" in Google spreadsheet using Drive python API
                            
                                Drawing SVG on Kivy canvas
                            
                                strip a verbose python regex
                            
                                Pricing a Floating Bond in quantlib using Python
                            
                                Transfer ownership of numpy data
                            
                                sqlalchemy: connect to MySQL without password

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

efficient numpy.fromfile on zipped files?

Tags:

python

zip

numpy

fromfile

Dominik Neise

People also ask

1 Answers

Joe Kington

Recent Activity

Donate For Us