I'm trying to do principal component analysis on datasets containing images, but whenever I want to apply pca.transform from the sklearn.decomposition module I keep getting this error: *AttributeError: 'PCA' object has no attribute 'mean_'*. I know what this error means, but I have no clue how to fix it. I reckon some of you guys know how to fix this.
Thank you for your help
My code:
from sklearn import svm
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA
image_dir1 = "C:\Users\private\Desktop\K FOLDER\private\train"
image_dir2 = "C:\Users\private\Desktop\K FOLDER\private\test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))
def matrix_image(image_path):
"opens image and converts it to a m*n matrix"
image = Image.open(image_path)
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
image = image.resize(Standard_size)
image = list(image.getdata())
image = map(list,image)
image = np.array(image)
return image
def flatten_image(image):
"""
takes in a n*m numpy array and flattens it to
an array of the size (1,m*n)
"""
s = image.shape[0] * image.shape[1]
image_wide = image.reshape(1,s)
return image_wide[0]
if __name__ == "__main__":
train_images = file_open(image_dir1,"*.jpg")
test_images = file_open(image_dir2,"*.jpg")
train_set = []
test_set = []
"Loop over all images in files and modify them"
train_set = [flatten_image(matrix_image(image)) for image in train_images]
test_set = [flatten_image(matrix_image(image)) for image in test_images]
train_set = np.array(train_set)
test_set = np.array(test_set)
train_set = pca.fit_transform(train_set) "line where error occurs"
test_set = pca.fit_transform(test_set)
Full traceback:
Traceback (most recent call last):
File "C:\Users\Private\workspace\final_submission\src\d.py", line 54, in <module>
train_set = pca.transform(train_set)
File "C:\Python27\lib\site-packages\sklearn\decomposition\pca.py", line 298, in transform
if self.mean_ is not None:
AttributeError: 'PCA' object has no attribute 'mean_'
Edit1: So I tried to fit the model before transforming it, and now I'm getting an even weirder error. I looked it up, and it involves f2py, a module that ports Fortran to Python which is part of the Numpy Library.
File "C:\Users\Private\workspace\final_submission\src\d.py", line 54, in <module>
pca.fit(train_set)
File "C:\Python27\lib\site-packages\sklearn\decomposition\pca.py", line 200, in fit
self._fit(X)
File "C:\Python27\lib\site-packages\sklearn\decomposition\pca.py", line 249, in _fit
U, S, V = linalg.svd(X, full_matrices=False)
File "C:\Python27\lib\site-packages\scipy\linalg\decomp_svd.py", line 100, in svd
full_matrices=full_matrices, overwrite_a = overwrite_a)
ValueError: failed to create intent(cache|hide)|optional array-- must have defined dimensions but got (0,)
Edit2:
So I have checked if my train_set and data_set contained any data and they don't. I've checked my image_dirs, and they contain the right locations(just for clarity, I got them by going to the actual files, looking at the properties of one the images and copied the location). The fault should lie somewhere else.
PCA should be used mainly for variables which are strongly correlated. If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. In general, if most of the correlation coefficients are smaller than 0.3, PCA will not help.
PCA is popular because it can effectively find an optimal representation of a data set with fewer dimensions. It is effective at filtering noise and decreasing redundancy.
When a given data set is not linearly distributed but might be arranged along with non-orthogonal axes or well described by a geometric parameter, PCA could fail to represent and recover original data from projected variables.
Low interpretability of principal components. Principal components are linear combinations of the features from the original data, but they are not as easy to interpret. For example, it is difficult to tell which are the most important features in the dataset after computing principal components.
You should fit the model before transform:
train_set = np.array(train_set)
test_set = np.array(test_set)
pca.fit(train_set)
pca.fit(test_set)
train_set = pca.transform(train_set) "line where error occurs"
test_set = pca.transform(test_set)
Edit
Second error indicate that your train_set
is empty. It can be easily reproduced using this code:
train_set = np.array([[]])
pca.fit(train_set)
I think one problem is in flatten_image
function. I may be wrong but this line will raise AttributeError
image.wide = image.reshape(1,s)
It can be replaced with:
image_wide = image.reshape(1,s)
return image_wide[0]
This line is problematic too:
print("changing size from %s to %s" % str(image.size), str(Standard_size))
Read http://docs.python.org/2/library/stdtypes.html#string-formatting-operations for more details, but values must be a tuple
. So you want this instead:
print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
Another edit
At last you replace loops aftert "Loop over all images in files and modify them"
with:
train_set = [flatten_image(matrix_image(image)) for image in train_images]
test_set = [flatten_image(matrix_image(image)) for image in test_images]
Right now you call file_open
so it will look for files in path like this: "C:\Users\private\Desktop\K FOLDER\private\train\C:\Users\private\Desktop\K FOLDER\private\train\foo.jpg"
and you get empty list instead of file name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With