I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?
I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word', token_pattern=r'\w{1,}',
ngram_range=(1, 2), use_idf=1, smooth_idf=1,
sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
Penalized logistic regression imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization.
In this logistic regression equation, logit(pi) is the dependent or response variable and x is the independent variable. The beta parameter, or coefficient, in this model is commonly estimated via maximum likelihood estimation (MLE).
Logistic regression does not really have any critical hyperparameters to tune. Sometimes, you can see useful differences in performance or convergence with different solvers (solver). Regularization (penalty) can sometimes be helpful. Note: not all solvers support all regularization terms.
The main hyperparameters we may tune in logistic regression are: solver, penalty, and regularization strength (sklearn documentation).
You can use grid search to find out the best C
value for you. Basically smaller C
specify stronger regularization.
>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
estimator=LogisticRegression(C=1.0, intercept_scaling=1,
dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})
See the GridSearchCv document for more details on your application.
Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With