Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient numpy.fromfile on zipped files?

I have some large (even gzipped around 10GB) files, which contain an ASCII header and then in principle numpy.recarrays of about 3MB each, we call them "events". My first approach looked like this:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
event = np.fromfile( f, dtype = event_dtype, count=1 )

However, this is not possible, since np.fromfile needs a real FILE object, because it really makes low level calls (found a pretty old ticket https://github.com/numpy/numpy/issues/1103).

So as I understand I have to do it like this:

s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)

And yes, it works! But isn't this awfully inefficient? Isn't the mem for s allocated, and garbage collected for every event? On my laptop I reach something like 16 events/s, i.e. ~50MB/s

I wonder if anybody knows a smart way, to allocate the mem once and then let numpy read directly into that mem.

Btw. I'm a physicist, so ... well still a newbie in this business.

like image 975
Dominik Neise Avatar asked Apr 12 '13 08:04

Dominik Neise


People also ask

Can you zip Numpy arrays?

If we want to bind or zip the data of different array then we can go for zip function in python of NumPy. This function eliminates the need of class object if not required. We can easily map our data with any number of the array and this can be done very easily with the use of the zip() function.

What is Fromfile in Numpy?

fromfile() function. The fromfile() function is used to construct an array from data in a text or binary file. Syntax: numpy.fromfile(file, dtype=float, count=-1, sep='') Version: 1.15.0.


1 Answers

@Bakuriu is probably correct that this is probably a micro-optimization. Your bottleneck is almost definitely IO, and after that, decompression. Allocating the memory twice probably isn't significant.

However, if you wanted to avoid the extra memory allocation, you could use numpy.frombuffer to view the string as a numpy array.

This avoids duplicating memory (the string and the array use the same memory buffer), but the array will be read-only, by default. You can then change it to allow writing, if you need to.

In your case, it would be as simple as replacing fromstring with frombuffer:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
s = f.read( event_dtype.itemsize )
event = np.frombuffer(s, dtype=event_dtype, count=1)

Just to prove that memory is not duplicated using this approach:

import numpy as np

x = "hello"
y = np.frombuffer(x, dtype=np.uint8)

# Make "y" writeable...
y.flags.writeable = True

# Prove that we're using the same memory
y[0] = 121
print x # <-- Notice that we're outputting changing y and printing x...

This yields: yello instead of hello.

Regardless of whether or not it's a significant optimization in this particular case, it's a useful approach to be aware of.

like image 109
Joe Kington Avatar answered Oct 26 '22 17:10

Joe Kington