Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does numpy handle mmap's over npz files?

Tags:

gzip

numpy

mmap

I have a case where I would like to open a compressed numpy file using mmap mode, but can't seem to find any documentation about how it will work under the covers. For example, will it decompress the archive in memory and then mmap it? Will it decompress on the fly?

The documentation is absent for that configuration.

like image 302
Refefer Avatar asked Mar 16 '15 15:03

Refefer


People also ask

What is numpy Npz file?

The . npz file format is a zipped archive of files named after the variables they contain. The archive is not compressed and each file in the archive contains one variable in . npy format.

How does numpy Memmap work?

memmap() function. The memmap() function is used to create a memory-map to an array stored in a binary file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

What is the difference between Npz and NPY?

npy file contains a single numpy array, stored in a binary format along with its shape, data type, etc. An . npz file contains a collection numpy arrays each encoded in the . npy format and stored in a ZIP file.

What is inside Npz file?

npz is a numpy file format that stores data in named variables.


1 Answers

The short answer, based on looking at the code, is that archiving and compression, whether using np.savez or gzip, is not compatible with accessing files in mmap_mode. It's not just a matter of how it is done, but whether it can be done at all.

Relevant bits in the np.load function

elif isinstance(file, gzip.GzipFile):
    fid = seek_gzip_factory(file)
...
    if magic.startswith(_ZIP_PREFIX):
        # zip-file (assume .npz)
        # Transfer file ownership to NpzFile
        tmp = own_fid 
        own_fid = False
        return NpzFile(fid, own_fid=tmp)
...
    if mmap_mode:
        return format.open_memmap(file, mode=mmap_mode)

Look at np.lib.npyio.NpzFile. An npz file is a ZIP archive of .npy files. It loads a dictionary(like) object, and only loads the individual variables (arrays) when you access them (e.g. obj[key]). There's no provision in its code for opening those individual files inmmap_mode`.

It's pretty obvious that a file created with np.savez cannot be accessed as mmap. The ZIP archiving and compression is not the same as the gzip compression addressed earlier in the np.load.

But what of a single array saved with np.save and then gzipped? Note that format.open_memmap is called with file, not fid (which might be a gzip file).

More details on open_memmap in np.lib.npyio.format. Its first test is that file must be a string, not an existing file fid. It ends up delegating the work to np.memmap. I don't see any provision in that function for gzip.

like image 117
hpaulj Avatar answered Oct 24 '22 13:10

hpaulj