Incremental PCA on big data

Tags:

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

Thanks for help

788

asked Jul 15 '15 11:07

KrawallKurt

1 Answers

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

It creates a nice 3.8 GiB file.

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

answered Oct 05 '22 13:10

sastanin

Related questions
                            
                                Downloading and accessing data from github python
                            
                                How to change numpy array into grayscale opencv image
                            
                                How to install Flask on Python3 using pip?
                            
                                Python - Matrix outer product
                            
                                How to catch these exceptions individually?
                            
                                Remove colorbar's borders matplotlib
                            
                                How to evaluate single integrals of multivariate functions with Python's scipy.integrate.quad?
                            
                                Celery autodiscover_tasks not working for all Django 1.7 apps
                            
                                Initializer vs Constructor [duplicate]
                            
                                Extending the behavior of an inherited function in Python
                            
                                How to implement a priority queue using SQS(Amazon simple queue service)
                            
                                Python enumerate list setting start index but without increasing end count
                            
                                Itertools product without repeating duplicates
                            
                                django difference between clear() and delete()
                            
                                apply sort to a pandas groupby operation
                            
                                regex.sub() gives different results to re.sub()
                            
                                How to print result of clustering in sklearn
                            
                                How to plot non-square Seaborn jointplot or JointGrid
                            
                                Interleave list with fixed element
                            
                                Return Custom 404 Error when resource not found in Django Rest Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Incremental PCA on big data

Tags:

python

hdf5

scikit-learn

pca

bigdata

KrawallKurt

People also ask

1 Answers

sastanin

Recent Activity

Donate For Us