Short format:
How to implement multi-class logistic regression classification algorithms via gradient descent in R? Can optim()
be used when there are more than two labels?
The MatLab code is:
function [J, grad] = cost(theta, X, y, lambda)
m = length(y);
J = 0;
grad = zeros(size(theta));
h_theta = sigmoid(X * theta);
J = (-1/m)*sum(y.*log(h_theta) + (1-y).*log(1-h_theta)) +...
(lambda/(2*m))*sum(theta(2:length(theta)).^2);
trans = X';
grad(1) = (1/m)*(trans(1,:))*(h_theta - y);
grad(2:size(theta, 1)) = 1/m * (trans(2:size(trans,1),:)*(h_theta - y) +...
lambda * theta(2:size(theta,1),:));
grad = grad(:);
end
and...
function [all_theta] = oneVsAll(X, y, num_labels, lambda)
m = size(X, 1);
n = size(X, 2);
all_theta = zeros(num_labels, n + 1);
initial_theta = zeros(n+1, 1);
X = [ones(m, 1) X];
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = ...
fmincg (@(t)(cost(t, X, (y == c), lambda)), ...
initial_theta, options);
all_theta(c,:) = theta';
end
Long format:
Although probably not needed to follow the question, the dataset can be downloaded here and once downloaded and placed in the R directory, loaded as:
library(R.matlab)
data <- readMat('data.mat')
str(data)
List of 2
$ X: num [1:5000, 1:400] 0 0 0 0 0 0 0 0 0 0 ...
$ y: num [1:5000, 1] 10 10 10 10 10 10 10 10 10 10 ...
So X
is a matrix with 5,000 examples, each containing 400 features, which happen to be 400 pixels of a 20 x 20 image of a handwritten digit from 1 to 10, as for example this 9:
Applying a logistic regression algorithm to predict the handwritten number based on "computer vision" of the values in these 400 pixels presents the additional challenge of not being a binary decision. Optimizing the coefficients is unlikely to be efficient with an ad hoc gradient descent loop as in this R-bloggers example.
There is a nicely worked out example also in R-bloggers based on two explanatory variables (features) and a dichotomous outcome. The example uses the optim()
R function, which is seems to be the way to go.
Even though I have read the documentation, I am having problems setting up this more complex example, where we have to decide among 10 possible outcomes:
library(R.matlab)
data <- readMat('data.mat')
X = data$X # These are the values for the pixels in all 5000 examples.
y = data$y # These are the actual correct labels for each example.
y = replace(y, y == 10, 0) # Replacing 10 with 0 for simplicity.
# Defining the sigmoid function for logistic regression.
sigmoid = function(z){
1 / (1 + exp(-z))
}
X = cbind(rep(1, nrow(X)), X) # Adding an intercept or bias term (column of 1's).
# Defining the regularized cost function parametrized by the coefficients.
cost = function(theta){
hypothesis = sigmoid(X%*%theta)
# In "J" below we will need to have 10 columns of y:
y = as.matrix(model.matrix(lm(y ~ as.factor(y))))
m = nrow(y)
lambda = 0.1
# The regularized cost function is:
J = (1/m) * sum(-y * log(hypothesis) - (1 - y) * log(1 - hypothesis)) +
(lambda/(2 * m)) * sum(theta[2:nrow(theta), 1]^2)
J
}
no.pixels_plus1 = ncol(X) # These are the columns of X plus the intercept.
no.digits = length(unique(y)) # These are the number of labels (10).
# coef matrix rows = no. of labels; cols = no. pixels plus intercept:
theta_matrix = t(matrix(rep(0, no.digits*no.pixels_plus1), nrow = no.digits))
cost(theta_matrix) # The initial cost:
# [1] 0.6931472
theta_optim = optim(par = theta_matrix, fn = cost) # This is the PROBLEM step!
Evidently this seems incomplete, and gives me the error message:
Error in X %*% theta : non-conformable arguments
Notice that X%*%theta_matrix
is carried out without any issues. So the problem has to be in the fact that I have 10 classifiers (0 to 9), and that I'm forced to create a matrix with 10 y
column vectors to make the operations feasible with the function cost
. It is possible that the solution goes through dummy-code the y
vector with some line like: y = as.matrix(model.matrix(lm(y ~ as.factor(y))))
as in my non-working code above, but yet again, I don't know that this encapsulates the "one-versus-all" idea - OK, probably not, and probably this is the issue.
Otherwise, it seems to work on the R-bloggers post with a binary classifier and extremely parallel to identical code.
So what is the right syntax for this problem?
Note that I have tried to work it out one digit against all others, but I don't think that makes sense in terms of complexity.
Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.
Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.
The multinom function accepts a model formula where the outcome is a vector with a factor representing the response categories, or a matrix with the counts in the various categories, which is the case for us. This is a direct generalization of the way logit models work in R.
Also, it gives a good insight on what the multinomial logistic regression is: a set of J−1 independent logistic regressions for the probability of Y=j versus the probability of the reference Y=J. Y = J . pj(x)=eβ0j+β1jX1+⋯+βpjXppJ(x).
The theta
you feed to optim
must be a vector. You can turn it into a matrix within the cost function.
See previous question here: How to get optim working with matrix multiplication inside the function to be maximized in R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With