Genetic algorithms: fitness function for feature selection algorithm

Tags:

I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.

I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set. Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.

What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.

Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?

599

asked Nov 03 '11 09:11

agnieszka

1 Answers

This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.

The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.

Reducing to code is straightforward:

Calculate the covariance matrix of your dataset (first remove the column that holds the response variable, i.e., probably the last one). If your dataset is m x n (columns x rows), then this covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance matrix; this will give you the proportion of the total variability in the response variable, contributed by that eigenvalue (each eigenvalue corresponds to a feature, or column). [Note, singular-value decomposition (SVD) is often used for this step, but it's unnecessary--an eigenvalue decomposition is much simpler, and always does the job as long as your matrix is square, which covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of candidate solutions (features subsets, in your case). The next task in GA, or any combinatorial optimization, is to rank those candiate solutions by their cost function score. In your case, the cost function is a simple summation of the eigenvalue proportion for each feature in that subset. (I guess you would want to scale/normalize that calculation so that the higher numbers are the least fit though.)

A sample calculation (using python + NumPy):

Click to copy

>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA

>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
     (4, 4)
>>> C
     array([[ 1.  , -0.11,  0.87,  0.82],
            [-0.11,  1.  , -0.42, -0.36],
            [ 0.87, -0.42,  1.  ,  0.96],
            [ 0.82, -0.36,  0.96,  1.  ]])

>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1))   # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
        print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))

   ev value  proportion    
   1   2.91   0.727
   2   0.92   0.953
   3   0.14   0.995
   4   0.02   1.000

so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).

answered Nov 05 '22 15:11

doug

Related questions
                            
                                Matrix factorization for collaborative filtering - new users and items?
                            
                                How to normalize an image color?
                            
                                Unseen nominal values in weka
                            
                                Do convolutional neural networks suffer from the vanishing gradient?
                            
                                Is there any way to train a sklearn model by disk data like HDF5 or such ?
                            
                                xgboost predict method returns the same predicted value for all rows
                            
                                How to get a concurrency of 1000 requests with Flask and Gunicorn [closed]
                            
                                Run model in reverse in Keras
                            
                                One dimensional data with CNN
                            
                                AttributeError: module 'tensorflow.contrib.learn' has no attribute 'TensorFlowDNNClassifier'
                            
                                How to create my own datasets using in scikit-learn?
                            
                                AttributeError:'Tensor' object has no attribute '_keras_history'
                            
                                Add hand-crafted features to Keras sequential model
                            
                                How can you re-use a variable scope in tensorflow without a new scope being created by default?
                            
                                Pytorch: How to create an update rule that doesn't come from derivatives?
                            
                                Sigmoid output - can it be interpreted as probability?
                            
                                Difference between predict vs predict_proba in scikit-learn
                            
                                Weighted Decision Trees using Entropy
                            
                                Genetic Programming - Fitness functions
                            
                                Need a specific example of U-Matrix in Self Organizing Map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Genetic algorithms: fitness function for feature selection algorithm

Tags:

machine-learning

genetic-algorithm

feature-selection

agnieszka

People also ask

1 Answers

doug

Recent Activity

Donate For Us