selection of features using PCA

Tags:

I am doing unsupervised classification. For this I have 8 features (Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue) for classification per image and I want to select 3 most significant features using PCA. I have written the following code for feature selection (where dimension of feature is : 179X8) :

for c=1:size(feature,1)
   feature(c,:)=feature(c,:)-mean(feature)
end

DataCov=cov(feature); % covariance matrix
[PC,variance,explained] = pcacov(DataCov)

This gives me :

PC =

0.0038   -0.0114    0.0517    0.0593    0.0039    0.3998    0.9085   -0.0922
0.0755   -0.1275    0.6339    0.6824   -0.3241   -0.0377   -0.0641    0.0052
0.7008    0.7113   -0.0040    0.0496   -0.0207    0.0042    0.0012    0.0002
0.0007   -0.0012    0.0051    0.0101    0.0272    0.0288    0.0873    0.9953
0.0320   -0.0236    0.1521    0.2947    0.9416   -0.0142   -0.0289   -0.0266
0.7065   -0.6907   -0.1282   -0.0851    0.0060    0.0003    0.0010   -0.0001
0.0026   -0.0037    0.0632   -0.0446    0.0053    0.9125   -0.4015    0.0088
0.0543   -0.0006    0.7429   -0.6574    0.0838   -0.0705    0.0311   -0.0001

variance =

explained =

This means first principle component has 94.9% variance explained and so on ... but these are in order of most to least significant. How can I know which features (from 1 to 8) to be selected based on above information.

266

asked Nov 09 '12 13:11

Dev

2 Answers

Your problem is the same as the COLUMNSELECT problem discussed by Mahoney and Drineas in "CUR matrix decompositions for improved data analysis".

They first compute the leverage scores for each dimension and then selects 3 of them randomly using the leverage scores as weights. Alternatively, you can select the largest ones. Here's the script for your problem:

I first got a real nature image from the web and resized it to the dimensions you ask. The image is as follows:

%# Example data from real image of size 179x8
%# You can skip it for your own data
features = im2double(rgb2gray(imread('img.png')));

%# m samples, n dimensions
[m,n] = size(features);

Then, compute the centralized data:

%# Remove the mean
features = features - repmat(mean(features,2), 1, size(features,2));

I use SVD to compute PCA since it gives you both the principal components and the coefficients. If the samples are in columns, then U holds the principal components. Check the second page of this paper for the relationship.

%# Compute the SVD
[U,S,V] = svd(features);

The key idea here is that we want to get the dimensions having most of the variation. And an assumption is that there's some noise in data. We select only the dominant eigenvectors, e.g. representing the 95% of the data.

%# Compute the number of eigenvectors representing
%#  the 95% of the variation
coverage = cumsum(diag(S));
coverage = coverage ./ max(coverage);
[~, nEig] = max(coverage > 0.95);

Then the leverage scores are computed using nEig of the principal components. That is, we take the norm of the nEig coefficients.

%# Compute the norms of each vector in the new space
norms = zeros(n,1);
for i = 1:n
    norms(i) = norm(V(i,1:nEig))^2;
end

Then, we can sort the leverage scores:

%# Get the largest 3
[~, idx] = sort(norms);
idx(1:3)'

and get the indices of the vectors with the largest leverage scores:

ans =
   6     8     5

You can check the paper for more details.

But, keep in mind that PCA-based technique is good if you have many many dimensions. In your case, the search space is very small. My advice is to search exhaustively in the space and get the best selection as @amit recommends.

answered Oct 30 '22 02:10

petrichor

PCA is actually generating a set of new features, each is a linear transformation from the original elements.

Thus, the vector you get cannot directly be translated to the features you need to chose in order to get this variance- it just creates a new feature based on the originals.
In your case, you get:

New_Feature = 0.038*F1 + 0.0755*F2 + 0.7008*F3 + ... + 0.0543*F8

This New_Feature gives you 94.9471% information gain, despite the dimensionality reduction.
(And if you do the same for the next principle comopnents and use them as well, you obviously increase your information gain)

If you need to get a subset of the originals, and not create new features - I would have used other methods instead of PCA.

Genetic Algorithms are usually pretty good for subset selection, and if your set of features only include 8 features - you could also consider brute force search - there are only 2⁸=256 possible subsets. It might be possible in some cases to try all subsets and see what gives you the best performance.

answered Oct 30 '22 01:10

amit

Related questions
                            
                                scale and reduce colors to reduce file size of scan
                            
                                Swap the elements of two sequences, such that the difference of the element-sums gets minimal.
                            
                                Fastest way to select all strings from list starting from
                            
                                How to efficiently calculate distance to nearest 1 in mask in numpy?
                            
                                Unable to multi-thread a scalable method
                            
                                Recursion Vs Iteration
                            
                                Need help creating a word ladder between many one letter different pairs of words (java)
                            
                                How to find tops and bottoms in time series?
                            
                                Dynamic programming sum
                            
                                Creating Binary Search Trees
                            
                                Is kd-Tree an alternative to K-means clustering?
                            
                                Implementing Graham Scan to find the convex hull
                            
                                Solve recurrence: T(n) = T(n^(1/2)) + Θ(lg lg n) [closed]
                            
                                Full array not being passed in C
                            
                                How to make a charcoal drawing filter
                            
                                Find the two non-repeating elements in an array of repeating element XOR operator?
                            
                                pronounceability algorithm
                            
                                Fastest strategy to form and sort an array of positive integers
                            
                                Algorithm to get the number of sorted combinations?
                            
                                What would be the fastest way to sort an array of words containing a-z and spaces only?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

selection of features using PCA

Tags:

algorithm

image-processing

matlab

pca

Dev

People also ask

2 Answers

petrichor

amit

Recent Activity

Donate For Us