Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python PCA on Matrix too large to fit into memory

I have a csv that is 100,000 rows x 27,000 columns that I am trying to do PCA on to produce a 100,000 rows X 300 columns matrix. The csv is 9GB large. Here is currently what I'm doing:

from sklearn.decomposition import PCA as RandomizedPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
X = pd.DataFrame.from_csv(dataset)
Y = X.pop("Y_Level")
X = (X - X.mean()) / (X.max() - X.min())
Y = list(Y)
dimensions = 300
sklearn_pca = RandomizedPCA(n_components=dimensions)
X_final = sklearn_pca.fit_transform(X)

When I run the above code, my program is killed while doing the .from_csv in step. I've been able to get around that by spliting the csv into sets of 10,000; reading them in 1 by 1, and then calling pd.concat. This allows me to get to the normalization step (X - X.mean()).... before getting killed. Is my data just too big for my macbook air? Or is there a better way to do this. I would really love to use all the data I have for my machine learning application.


If i wanted to use incremental PCA as suggested by the answer below, is this how I would do it?:

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
chunksize_ = 10000
#total_size is 100000
dimensions = 300

reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
Y = []
for chunk in reader:
    y = chunk.pop("virginica")
    Y = Y + list(y)
    sklearn_pca.partial_fit(chunk)
X = ???
#This is were i'm stuck, how do i take my final pca and output it to X,
#the normal transform method takes in an X, which I don't have because I
#couldn't fit it into memory.

I can't find any good examples online.

like image 811
mt88 Avatar asked Aug 24 '15 20:08

mt88


1 Answers

Try to divide your data or load it by batches into script, and fit your PCA with Incremetal PCA with it's partial_fit method on every batch.

from sklearn.decomposition import IncrementalPCA
import csv
import sys
import numpy as np
import pandas as pd

dataset = sys.argv[1]
chunksize_ = 5 * 25000
dimensions = 300

reader = pd.read_csv(dataset, sep = ',', chunksize = chunksize_)
sklearn_pca = IncrementalPCA(n_components=dimensions)
for chunk in reader:
    y = chunk.pop("Y")
    sklearn_pca.partial_fit(chunk)

# Computed mean per feature
mean = sklearn_pca.mean_
# and stddev
stddev = np.sqrt(sklearn_pca.var_)

Xtransformed = None
for chunk in pd.read_csv(dataset, sep = ',', chunksize = chunksize_):
    y = chunk.pop("Y")
    Xchunk = sklearn_pca.transform(chunk)
    if Xtransformed == None:
        Xtransformed = Xchunk
    else:
        Xtransformed = np.vstack((Xtransformed, Xchunk))

Useful link

like image 150
Ibraim Ganiev Avatar answered Sep 27 '22 19:09

Ibraim Ganiev