I have a really large .npy file (previously saved with np.save) and I am loading it with:
np.load(open('file.npy'))
Is there any way to see the progress of the loading process? I know tqdm and some other libraries for monitoring the progress but don't how to use them for this problem.
Thank you!
The numpy. load() is used to load arrays or pickled objects from files with . npy , and . npz extensions to volatile memory or program. Pickling is a process in which Python objects are converted into streams of bytes to store data in file.
npy file format. This file format makes incredibly fast reading speed enhancement over reading from plain text or CSV files.
mode : {'r+', 'r', 'w+', 'c'}, optional The file is opened in this mode: 'r' Open existing file for reading only. 'r+' Open existing file for reading and writing. 'w+' Create or overwrite existing file for reading and writing. 'c' Copy-on-write: assignments affect data in memory, but changes are not saved to disk.
As far I am aware, np.load
does not provide any callbacks or hooks to monitor progress. However, there is a work around which may work: np.load
can open the file as a memory-mapped file, which means the data stays on disk and is loaded into memory only on demand. We can abuse this machinery to manually copy the data from the memory mapped file into actual memory using a loop whose progress can be monitored.
Here is an example with a crude progress monitor:
import numpy as np
x = np.random.randn(8096, 4096)
np.save('file.npy', x)
blocksize = 1024 # tune this for performance/granularity
try:
mmap = np.load('file.npy', mmap_mode='r')
y = np.empty_like(mmap)
n_blocks = int(np.ceil(mmap.shape[0] / blocksize))
for b in range(n_blocks):
print('progress: {}/{}'.format(b, n_blocks)) # use any progress indicator
y[b*blocksize : (b+1) * blocksize] = mmap[b*blocksize : (b+1) * blocksize]
finally:
del mmap # make sure file is closed again
assert np.all(y == x)
Plugging any progress-bar library into the loop should be straight forward.
I was unable to test this with exceptionally large arrays due to memory constraints, so I can't really tell if this approach has any performance issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With