Local and global minima of the cost function in logistic regression

Tags:

I'm misunderstanding the idea behind the minima in the derivation of the logistic regression formula.

The idea is to increase the hypothesis as much as possible (i.e correct prediction probability close to 1 as possible), which in turn requires minimising the cost function $J(\theta)$ as much as possible.

Now I've been told that for this all to work, the cost function must be convex. My understanding of convexity requires there to be no maximums, and therefore there can only be one minimum, the global minimum. Is this really the case? If it's not, please explain why not. Also, if it's not the case, then that implies the possibility of multiple minima in the cost function, implying multiple sets of parameters yielding higher and higher probabilities. Is this possible? Or can I be certain the returned parameters refer to the global minima and hence highest probability/ prediction?

594

asked Oct 09 '16 13:10

Keir Simmons

1 Answers

The fact that we use convex cost function does not guarantee a convex problem.

There is a distinction between a convex cost function and a convex method.

The typical cost functions you encounter (cross entropy, absolute loss, least squares) are designed to be convex.

However, the convexity of the problem depends also on the type of ML algorithm you use.

Linear algorithms (linear regression, logistic regression etc) will give you convex solutions, that is they will converge. When using neural nets with hidden layers however, you are no longer guaranteed a convex solution.

Thus, convexity is a measure of describing your method not only your cost function!

LR is a linear classification method so you should get a convex optimization problem each time you use it! However, if the data is not linearly separable, it might not give a solution and it definitely won't give you a good solution in that case.

135

answered Mar 27 '23 15:03

Peter Dimmar

Related questions
                            
                                Database of surveillance camera locations
                            
                                Using the Apache Mahout machine learning libraries [closed]
                            
                                What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs
                            
                                Plotting a decision boundary in matlab
                            
                                Simple gradient boosting algorithm
                            
                                Sparse coding in Python [closed]
                            
                                Persisting data in sklearn
                            
                                Probability basics for machine learning [closed]
                            
                                How do I cluster with KL-divergence?
                            
                                what is the difference between the stacking grading, and voting algorithms?
                            
                                Scikit learn - How to use SVM and Random Forest for text classification?
                            
                                Training SVM with variable sized hog descriptors of training images (MATLAB)
                            
                                What is wrong with my Gradient Descent algorithm
                            
                                scikit weighted f1 score calculation and usage
                            
                                Tensorflow classification with extremely unbalanced dataset
                            
                                Update a subset of weights in TensorFlow
                            
                                Python scikit svm "ValueError: X has 62 features per sample; expecting 337"
                            
                                Adaboost with neural networks
                            
                                Why the decision tree structure is only binary tree for sklearn DecisionTreeClassifier?
                            
                                Label smoothing (soft targets) in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Local and global minima of the cost function in logistic regression

Tags:

machine-learning

logistic-regression

convex-optimization

convex

Keir Simmons

People also ask

1 Answers

Peter Dimmar

Recent Activity

Donate For Us