Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R for multi-class logistic regression

Short format:

How to implement multi-class logistic regression classification algorithms via gradient descent in R? Can optim() be used when there are more than two labels?

The MatLab code is:

function [J, grad] = cost(theta, X, y, lambda)
    m = length(y);
    J = 0;
    grad = zeros(size(theta));
    h_theta = sigmoid(X * theta);
    J = (-1/m)*sum(y.*log(h_theta) + (1-y).*log(1-h_theta)) +...
    (lambda/(2*m))*sum(theta(2:length(theta)).^2);
    trans = X';
    grad(1) = (1/m)*(trans(1,:))*(h_theta - y);
    grad(2:size(theta, 1)) = 1/m * (trans(2:size(trans,1),:)*(h_theta - y) +...
    lambda * theta(2:size(theta,1),:));
    grad = grad(:);
end

and...

function [all_theta] = oneVsAll(X, y, num_labels, lambda)
    m = size(X, 1);
    n = size(X, 2);
    all_theta = zeros(num_labels, n + 1);
    initial_theta = zeros(n+1, 1);
    X = [ones(m, 1) X];
    options = optimset('GradObj', 'on', 'MaxIter', 50);
       for c = 1:num_labels,
     [theta] = ...
         fmincg (@(t)(cost(t, X, (y == c), lambda)), ...
                 initial_theta, options);
     all_theta(c,:) = theta';
end

Long format:

Although probably not needed to follow the question, the dataset can be downloaded here and once downloaded and placed in the R directory, loaded as:

library(R.matlab)
data <- readMat('data.mat')
str(data)
List of 2
 $ X: num [1:5000, 1:400] 0 0 0 0 0 0 0 0 0 0 ...
 $ y: num [1:5000, 1] 10 10 10 10 10 10 10 10 10 10 ...

So X is a matrix with 5,000 examples, each containing 400 features, which happen to be 400 pixels of a 20 x 20 image of a handwritten digit from 1 to 10, as for example this 9:

enter image description here

Applying a logistic regression algorithm to predict the handwritten number based on "computer vision" of the values in these 400 pixels presents the additional challenge of not being a binary decision. Optimizing the coefficients is unlikely to be efficient with an ad hoc gradient descent loop as in this R-bloggers example.

There is a nicely worked out example also in R-bloggers based on two explanatory variables (features) and a dichotomous outcome. The example uses the optim() R function, which is seems to be the way to go.

Even though I have read the documentation, I am having problems setting up this more complex example, where we have to decide among 10 possible outcomes:

    library(R.matlab)
    data <- readMat('data.mat')

    X = data$X                 # These are the values for the pixels in all 5000 examples.
    y = data$y                 # These are the actual correct labels for each example.
    y = replace(y, y == 10, 0) # Replacing 10 with 0 for simplicity.

    # Defining the sigmoid function for logistic regression.
       sigmoid = function(z){
            1 / (1 + exp(-z))
       }

    X = cbind(rep(1, nrow(X)), X) # Adding an intercept or bias term (column of 1's).

    # Defining the regularized cost function parametrized by the coefficients.

       cost = function(theta){ 
           hypothesis = sigmoid(X%*%theta)
           # In "J" below we will need to have 10 columns of y:
           y = as.matrix(model.matrix(lm(y ~ as.factor(y))))
           m = nrow(y)
           lambda = 0.1
           # The regularized cost function is:
           J = (1/m) * sum(-y * log(hypothesis)  - (1 - y) * log(1 - hypothesis)) +
    (lambda/(2 * m)) * sum(theta[2:nrow(theta), 1]^2)
           J
        }

    no.pixels_plus1 = ncol(X)     # These are the columns of X plus the intercept.
    no.digits = length(unique(y)) # These are the number of labels (10).
    # coef matrix rows = no. of labels; cols = no. pixels plus intercept:
    theta_matrix = t(matrix(rep(0, no.digits*no.pixels_plus1), nrow = no.digits))
    cost(theta_matrix) # The initial cost:
    # [1] 0.6931472
    theta_optim = optim(par = theta_matrix, fn = cost) # This is the PROBLEM step!

Evidently this seems incomplete, and gives me the error message:

 Error in X %*% theta : non-conformable arguments 

Notice that X%*%theta_matrix is carried out without any issues. So the problem has to be in the fact that I have 10 classifiers (0 to 9), and that I'm forced to create a matrix with 10 y column vectors to make the operations feasible with the function cost. It is possible that the solution goes through dummy-code the y vector with some line like: y = as.matrix(model.matrix(lm(y ~ as.factor(y)))) as in my non-working code above, but yet again, I don't know that this encapsulates the "one-versus-all" idea - OK, probably not, and probably this is the issue.

Otherwise, it seems to work on the R-bloggers post with a binary classifier and extremely parallel to identical code.

So what is the right syntax for this problem?

Note that I have tried to work it out one digit against all others, but I don't think that makes sense in terms of complexity.

like image 588
Antoni Parellada Avatar asked May 31 '16 11:05

Antoni Parellada


People also ask

Can logistic regression be used for multiple classes?

Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.

What is multinomial logistic regression in R?

Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.

How does Multinom work in R?

The multinom function accepts a model formula where the outcome is a vector with a factor representing the response categories, or a matrix with the counts in the various categories, which is the case for us. This is a direct generalization of the way logit models work in R.

How do you write a multinomial logistic regression equation?

Also, it gives a good insight on what the multinomial logistic regression is: a set of J−1 independent logistic regressions for the probability of Y=j versus the probability of the reference Y=J. Y = J . pj(x)=eβ0j+β1jX1+⋯+βpjXppJ(x).


Video Answer


1 Answers

The theta you feed to optim must be a vector. You can turn it into a matrix within the cost function.

See previous question here: How to get optim working with matrix multiplication inside the function to be maximized in R

like image 101
Philippe Marchand Avatar answered Oct 11 '22 09:10

Philippe Marchand