Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently read and write files that are too large to fit in memory?

I am trying to calculate the cosine similarity of 100,000 vectors, and each of these vectors has 200,000 dimensions.

From reading other questions I know that memmap, PyTables and h5py are my best bets for handling this kind of data, and I am currently working with two memmaps; one for reading the vectors, the other for storing the matrix of cosine similarities.

Here is my code:

import numpy as np
import scipy.spatial.distance as dist

xdim = 200000
ydim = 100000

wmat = np.memmap('inputfile', dtype = 'd', mode = 'r', shape = (xdim,ydim))
dmat = np.memmap('outputfile', dtype = 'd', mode = 'readwrite', shape = (ydim,ydim))

for i in np.arange(ydim)):
    for j in np.arange(i+1,ydim):
        dmat[i,j] = dist.cosine(wmat[:,i],wmat[:,j])
        dmat.flush()

Currently, htop reports that I am using 224G of VIRT memory, and 91.2G of RES memory which is climbing steadily. It seems to me as if, by the end of the process, the entire output matrix will be stored in memory, which is something I'm trying to avoid.

QUESTION: Is this a correct usage of memmaps, am I writing to the output file in a memory efficient manner (by which I mean that only the necessary parts of the in- and output files i.e. dmat[i,j] and wmat[:,i/j], are stored in memory)?

If not, what did I do wrong, and how can I fix this?

Thanks for any advice you may have!

EDIT: I just realized that htop is reporting total system memory usage at 12G, so it seems it is working after all... anyone out there who can enlighten me? RES is now at 111G...

EDIT2: The memmap is created from a 1D array consisting of lots and lots of long decimals quite close to 0, which is shaped to the desired dimensions. The memmap then looks like this.

memmap([[  9.83721223e-03,   4.42584107e-02,   9.85033578e-03, ...,
     -2.30691545e-07,  -1.65070799e-07,   5.99395837e-08],
   [  2.96711345e-04,  -3.84307391e-04,   4.92968462e-07, ...,
     -3.41317722e-08,   1.27959347e-09,   4.46846438e-08],
   [  1.64766260e-03,  -1.47337747e-05,   7.43660202e-07, ...,
      7.50395136e-08,  -2.51943163e-09,   1.25393555e-07],
   ..., 
   [ -1.88709000e-04,  -4.29454722e-06,   2.39720287e-08, ...,
     -1.53058717e-08,   4.48678211e-03,   2.48127260e-07],
   [ -3.34207882e-04,  -4.60275148e-05,   3.36992876e-07, ...,
     -2.30274532e-07,   2.51437794e-09,   1.25837564e-01],
   [  9.24923862e-04,  -1.59552854e-03,   2.68354822e-07, ...,
     -1.08862665e-05,   1.71283316e-07,   5.66851420e-01]])
like image 911
Jojanzing Avatar asked Aug 23 '15 01:08

Jojanzing


2 Answers

In terms of memory usage, there's nothing particularly wrong with what you're doing at the moment. Memmapped arrays are handled at the level of the OS - data to be written is usually held in a temporary buffer, and only committed to disk when the OS deems it necessary. Your OS should never allow you to run out of physical memory before flushing the write buffer.

I'd advise against calling flush on every iteration since this defeats the purpose of letting your OS decide when to write to disk in order to maximise efficiency. At the moment you're only writing individual float values at a time.


In terms of IO and CPU efficiency, operating on a single line at a time is almost certainly suboptimal. Reads and writes are generally quicker for large, contiguous blocks of data, and likewise your calculation will probably be much faster if you can process many lines at once using vectorization. The general rule of thumb is to process as big a chunk of your array as will fit in memory (including any intermediate arrays that are created during your computation).

Here's an example showing how much you can speed up operations on memmapped arrays by processing them in appropriately-sized chunks.

Another thing that can make a huge difference is the memory layout of your input and output arrays. By default, np.memmap gives you a C-contiguous (row-major) array. Accessing wmat by column will therefore be very inefficient, since you're addressing non-adjacent locations on disk. You would be much better off if wmat was F-contiguous (column-major) on disk, or if you were accessing it by row.

The same general advice applies to using HDF5 instead of memmaps, although bear in mind that with HDF5 you will have to handle all the memory management yourself.

like image 72
ali_m Avatar answered Oct 11 '22 01:10

ali_m


Memory maps are exactly what the name says: mappings of (virtual) disk sectors into memory pages. The memory is managed by the operating system on demand. If there is enough memory, the system keeps parts of the files in memory, maybe filling up the whole memory, if there is not enough left, the system may discard pages read from file or may swap them into swap space. Normally you can rely on the OS is as efficient as possible.

like image 30
Daniel Avatar answered Oct 11 '22 02:10

Daniel