Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load .npy file with np.load progress bar

Tags:

I have a really large .npy file (previously saved with np.save) and I am loading it with:

np.load(open('file.npy'))

Is there any way to see the progress of the loading process? I know tqdm and some other libraries for monitoring the progress but don't how to use them for this problem.

Thank you!

like image 977
serchu Avatar asked Mar 09 '17 09:03

serchu


People also ask

What is NP load in Python?

The numpy. load() is used to load arrays or pickled objects from files with . npy , and . npz extensions to volatile memory or program. Pickling is a process in which Python objects are converted into streams of bytes to store data in file.

Why you should start using .NPY file more often?

npy file format. This file format makes incredibly fast reading speed enhancement over reading from plain text or CSV files.

How do I read a large NPY file in Python?

mode : {'r+', 'r', 'w+', 'c'}, optional The file is opened in this mode: 'r' Open existing file for reading only. 'r+' Open existing file for reading and writing. 'w+' Create or overwrite existing file for reading and writing. 'c' Copy-on-write: assignments affect data in memory, but changes are not saved to disk.


1 Answers

As far I am aware, np.load does not provide any callbacks or hooks to monitor progress. However, there is a work around which may work: np.load can open the file as a memory-mapped file, which means the data stays on disk and is loaded into memory only on demand. We can abuse this machinery to manually copy the data from the memory mapped file into actual memory using a loop whose progress can be monitored.

Here is an example with a crude progress monitor:

import numpy as np

x = np.random.randn(8096, 4096)
np.save('file.npy', x)

blocksize = 1024  # tune this for performance/granularity

try:
    mmap = np.load('file.npy', mmap_mode='r')
    y = np.empty_like(mmap)
    n_blocks = int(np.ceil(mmap.shape[0] / blocksize))
    for b in range(n_blocks):
        print('progress: {}/{}'.format(b, n_blocks))  # use any progress indicator
        y[b*blocksize : (b+1) * blocksize] = mmap[b*blocksize : (b+1) * blocksize]
finally:
    del mmap  # make sure file is closed again

assert np.all(y == x)

Plugging any progress-bar library into the loop should be straight forward.

I was unable to test this with exceptionally large arrays due to memory constraints, so I can't really tell if this approach has any performance issues.

like image 83
MB-F Avatar answered Sep 21 '22 10:09

MB-F