I am trying code from this page. I ran up to the part LR (tf-idf)
and got the similar results
After that I decided to try GridSearchCV
. My questions below:
1)
#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823
Then I calculated f1 score manually. why it is not matching?
logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489
scoring='precision'
why does it give below error? I am not clear mainly because I have relatively balanced dataset (55-45%) and f1
which requires precision
is getting calculated without any problems#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters) {'C': 0.1, 'penalty': 'l2'}
best score : 0.9474200393672962
logreg_cv
object. I used below method to get the predictions back. Is there a better way to do the same?logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
############################
############update 1
The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search
is that the reason I get below results #tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'} #best score : 0.7390325593588823
but when i do manually i get
f1_score(y_train, final_prediction) 0.9839388145315489
2)
I tried to tune using f1_micro
as suggested in the answer below. No error message. I am still not clear why f1_micro
is not failing when precision
fails
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[{ 0:0.95, 1:0.05 }, { 0:0.55, 1:0.45 }, { 0:0.45, 1:0.55 },{ 0:0.05, 1:0.95 }]}# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
tuned hpyerparameters :(best parameters) {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'newton-cg'}
best score : 0.7894909688013136
Let's take example of common machine learning algorithms starting with regression models: There are two different approaches which you can take, use gridsearchcv to perform hyperparameter tuning on one model or multiple models.
What is GridSearchCV? GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.
LogisticRegression (Logistic regression): Grid search is applied to select the most appropriate value of inverse regularization parameter, C. For this case, you could as well have used validation_curve (sklearn. model_selection) to select the most appropriate value of C.
You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01
res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]
params split0_test_score split1_test_score split2_test_score
0 {'C': 0.001, 'penalty': 'l2'} 0.000000 0.000000 0.000000
1 {'C': 0.01, 'penalty': 'l2'} 0.000000 0.000000 0.000000
2 {'C': 0.1, 'penalty': 'l2'} 0.973568 0.952607 0.952174
3 {'C': 1.0, 'penalty': 'l2'} 0.863934 0.851064 0.836449
4 {'C': 10.0, 'penalty': 'l2'} 0.811634 0.769547 0.787838
5 {'C': 100.0, 'penalty': 'l2'} 0.789826 0.762162 0.773438
6 {'C': 1000.0, 'penalty': 'l2'} 0.781003 0.750000 0.763871
You can check this:
lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])
And that the probabilities predicted drift towards the intercept:
# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])
lr.predict_proba(X_train_vectors_tfidf)
array([[0.58732636, 0.41267364],
[0.57074279, 0.42925721],
[0.57219143, 0.42780857],
...,
[0.57215605, 0.42784395],
[0.56988186, 0.43011814],
[0.58966184, 0.41033816]])
For the question on "get predictions on the train data back", i think that's the only way. The model is refitted on the whole training set using the best parameters, but the predictions or predicted probabilities are not stored. If you are looking for the values obtained during train / test, you can check cross_val_predict
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With