Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding optimal feature using Lasso regression in binary classification

I am working on large data, and I want to find important features. As I am a kind of biologist, so please forgive my lacking knowledge.

My dataset havs about 5000 attributes and 500 samples, which has binary classes 0 and 1. Also, the data set is biased - about 400 0s and 100 1s for samples. I want to find some features, which influence most in determining class.

  A1   A2   A3  ... Gn Class
S1    1.0  0.8 -0.1 ... 1.0 0 
S2    0.8  0.4  0.9 ... 1.0 0
S3   -1.0 -0.5 -0.8 ... 1.0 1
...

As I got some advice from previous question, I am trying to find attribute coefficient of which are high as important features, using Lasso regression using L1 penalty, because it makes score of unimportant features to 0.

I am doing this work using scikit-learn library.

So, my questions are like this.

  1. Can I use Lasso regression for biased binary class? If not, is it a good solution to use Logistic regression, although it does not use L1 penalty?

  2. How can I find optimal value of alpha using LassoCV? The document says that LassoCV supports it, but I cannot find the function.

  3. Is there other good way for this kind of classification?

Thank you very much.

like image 905
z991 Avatar asked Feb 09 '23 07:02

z991


1 Answers

You should use a classifier instead of a regressor so either SVM or Logistic Regression would do the job. Instead you can use SGDClassifier where you can set the loss parameter to 'log' for Logistic Regression or 'hinge' for SVM. In SGDClassifier you can set the penalty to either of 'l1', 'l2' or 'elasticnet' which is a combination of both.

You can find an opimum value of 'alpha' by either looping over different values of alpha and evaluating the performance over a validation set or you can use gridsearchcv as:

tuned_parameters = {'alpha': [10 ** a for a in range(-6, -2)]}
clf = GridSearchCV(SGDClassifier(loss='hinge', penalty='elasticnet',l1_ratio=0.15, n_iter=5, shuffle=True, verbose=False, n_jobs=10, average=False, class_weight='balanced')
                  , tuned_parameters, cv=10, scoring='f1_macro')

#now clf is the best classifier found given the search space
clf.fit(X_train, Y_train)
#you can find the best alpha here
print(clf.best_params_)    

This, searches over the range of values of alpha you have provided in tuned_parameters and then finds the best one. You can change the performance criteria from 'f1_macro' to 'f1_weighted' or other metrics.

To address the skewness of your dataset in terms of labels use the class_weight parameter of SGDCassifier and set it to "balanced".

To find the top 10 features contributing to the class labels you can find the indices as:

for i in range(0, clf.best_estimator_.coef_.shape[0]):
    top10 = np.argsort(clf.best_estimator_.coef_[i])[-10:]

Note 1: It is always good to keep some part of your dataset aside as validation/test set and after finding your optimum model evaluating it on the held out data.

Note 2: It is usually good to play a little bit with different types of feature normalisation and sample normalisation by dividing a row or a column to 'l2' or 'l1' of the row or column to see its effect on the performance using normalizer

Note 3: For elasticnet regularisation play a little bit with l1_ratio parameter.

like image 77
Ash Avatar answered Feb 14 '23 09:02

Ash