Using memmap files for batch processing

Question

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA.

Dataset Size-(140000,3504)

The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

This is really good, but unsure on how take advantage of this.

I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

When I say "my RAM blew", the Traceback I see is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

How can I improve this without comprising on accuracy by reducing the batch-size?

My ideas to diagnose:

I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

Edit: Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.

Edit2: If somehow I could delete a batch of processed memmap file. It would make much sense.

Edit3: Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

J Richard Snape · Accepted Answer

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.

Detailed diagnosis:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

ogrisel · Answer

Does the following alone trigger the crash?

X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
                         mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)

If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:

X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
                             mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
    X_batch_projected = clf.transform(X_train_mmap[batch])
    X_projected_mmap[batch] = X_batch_projected

I have not tested that code but I hope that you get the idea.

Using memmap files for batch processing

Tags:

python

numpy

scikit-learn

pca

Abhishek Bhatia

2 Answers

Detailed diagnosis:

J Richard Snape

ogrisel

Recent Activity

Donate For Us

Using memmap files for batch processing

Tags:

python

numpy

scikit-learn

pca

Abhishek Bhatia

2 Answers

Detailed diagnosis:

J Richard Snape

ogrisel

Related questions

Recent Activity

Donate For Us