Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fine-tuning parameters in Logistic Regression

Tags:

I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?

I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')

print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
                      analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), use_idf=1, smooth_idf=1, 
                      sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                           C=1, fit_intercept=True, intercept_scaling=1.0, 
                           class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)

X = X_all[:lentrain]
X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
like image 763
Simon Kiely Avatar asked Feb 16 '14 20:02

Simon Kiely


People also ask

What is penalty parameter in logistic regression?

Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization.

What are the parameters of logistic regression?

In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable. The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE).

Can we tune logistic regression?

Logistic regression does not really have any critical hyperparameters to tune. Sometimes, you can see useful differences in performance or convergence with different solvers (solver). Regularization (penalty) can sometimes be helpful. Note: not all solvers support all regularization terms.

What hyperparameters are important when tuning logistic regression models?

The main hyperparameters we may tune in logistic regression are: solver, penalty, and regularization strength (sklearn documentation).


2 Answers

You can use grid search to find out the best C value for you. Basically smaller C specify stronger regularization.

>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
             estimator=LogisticRegression(C=1.0, intercept_scaling=1,   
               dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})

See the GridSearchCv document for more details on your application.

like image 79
lennon310 Avatar answered Sep 19 '22 15:09

lennon310


Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.

like image 20
viplov Avatar answered Sep 17 '22 15:09

viplov