Short format: How to implement multi-class logistic regression classification algorithms via gradient descent in R? Can <code>optim()</code> be used when there are more than two labels? The MatLab code is: <pre class="prettyprint"><code>function [J, grad] = cost(theta, X, y, lambda) m = length(y); J = 0; grad = zeros(size(theta)); h_theta = sigmoid(X * theta); J = (-1/m)*sum(y.*log(h_theta) + (1-y).*log(1-h_theta)) +... (lambda/(2*m))*sum(theta(2:length(theta)).^2); trans = X'; grad(1) = (1/m)*(trans(1,:))*(h_theta - y); grad(2:size(theta, 1)) = 1/m * (trans(2:size(trans,1),:)*(h_theta - y) +... lambda * theta(2:size(theta,1),:)); grad = grad(:); end </code></pre> and... <pre class="prettyprint"><code>function [all_theta] = oneVsAll(X, y, num_labels, lambda) m = size(X, 1); n = size(X, 2); all_theta = zeros(num_labels, n + 1); initial_theta = zeros(n+1, 1); X = [ones(m, 1) X]; options = optimset('GradObj', 'on', 'MaxIter', 50); for c = 1:num_labels, [theta] = ... fmincg (@(t)(cost(t, X, (y == c), lambda)), ... initial_theta, options); all_theta(c,:) = theta'; end </code></pre> <hr> Long format: Although probably not needed to follow the question, the dataset can be downloaded here and once downloaded and placed in the R directory, loaded as: <pre class="prettyprint"><code>library(R.matlab) data <- readMat('data.mat') str(data) List of 2 $ X: num [1:5000, 1:400] 0 0 0 0 0 0 0 0 0 0 ... $ y: num [1:5000, 1] 10 10 10 10 10 10 10 10 10 10 ... </code></pre> So <code>X</code> is a matrix with 5,000 examples, each containing 400 features, which happen to be 400 pixels of a 20 x 20 image of a handwritten digit from 1 to 10, as for example this 9: <img src="https://i.stack.imgur.com/FnbKB.png" alt="enter image description here"> Applying a logistic regression algorithm to predict the handwritten number based on "computer vision" of the values in these 400 pixels presents the additional challenge of not being a binary decision. Optimizing the coefficients is unlikely to be efficient with an ad hoc gradient descent loop as in this R-bloggers example. There is a nicely worked out example also in R-bloggers based on two explanatory variables (features) and a dichotomous outcome. The example uses the <code>optim()</code> R function, which is seems to be the way to go. Even though I have read the documentation, I am having problems setting up this more complex example, where we have to decide among 10 possible outcomes: <pre class="prettyprint"><code> library(R.matlab) data <- readMat('data.mat') X = data$X # These are the values for the pixels in all 5000 examples. y = data$y # These are the actual correct labels for each example. y = replace(y, y == 10, 0) # Replacing 10 with 0 for simplicity. # Defining the sigmoid function for logistic regression. sigmoid = function(z){ 1 / (1 + exp(-z)) } X = cbind(rep(1, nrow(X)), X) # Adding an intercept or bias term (column of 1's). # Defining the regularized cost function parametrized by the coefficients. cost = function(theta){ hypothesis = sigmoid(X%*%theta) # In "J" below we will need to have 10 columns of y: y = as.matrix(model.matrix(lm(y ~ as.factor(y)))) m = nrow(y) lambda = 0.1 # The regularized cost function is: J = (1/m) * sum(-y * log(hypothesis) - (1 - y) * log(1 - hypothesis)) + (lambda/(2 * m)) * sum(theta[2:nrow(theta), 1]^2) J } no.pixels_plus1 = ncol(X) # These are the columns of X plus the intercept. no.digits = length(unique(y)) # These are the number of labels (10). # coef matrix rows = no. of labels; cols = no. pixels plus intercept: theta_matrix = t(matrix(rep(0, no.digits*no.pixels_plus1), nrow = no.digits)) cost(theta_matrix) # The initial cost: # [1] 0.6931472 theta_optim = optim(par = theta_matrix, fn = cost) # This is the PROBLEM step! </code></pre> Evidently this seems incomplete, and gives me the error message: <pre class="prettyprint"><code> Error in X %*% theta : non-conformable arguments </code></pre> Notice that <code>X%*%theta_matrix</code> is carried out without any issues. So the problem has to be in the fact that I have 10 classifiers (0 to 9), and that I'm forced to create a matrix with 10 <code>y</code> column vectors to make the operations feasible with the function <code>cost</code>. It is possible that the solution goes through dummy-code the <code>y</code> vector with some line like: <code>y = as.matrix(model.matrix(lm(y ~ as.factor(y))))</code> as in my non-working code above, but yet again, I don't know that this encapsulates the "one-versus-all" idea - OK, probably not, and probably this is the issue. Otherwise, it seems to work on the R-bloggers post with a binary classifier and extremely parallel to identical code. So what is the right syntax for this problem? Note that I have tried to work it out one digit against all others, but I don't think that makes sense in terms of complexity.

The <code>theta</code> you feed to <code>optim</code> must be a vector. You can turn it into a matrix within the cost function. See previous question here: How to get optim working with matrix multiplication inside the function to be maximized in R

Using R for multi-class logistic regression

Tags:

r

matlab

logistic-regression

gradient-descent

Short format:

How to implement multi-class logistic regression classification algorithms via gradient descent in R? Can optim() be used when there are more than two labels?

The MatLab code is:

function [J, grad] = cost(theta, X, y, lambda)
    m = length(y);
    J = 0;
    grad = zeros(size(theta));
    h_theta = sigmoid(X * theta);
    J = (-1/m)*sum(y.*log(h_theta) + (1-y).*log(1-h_theta)) +...
    (lambda/(2*m))*sum(theta(2:length(theta)).^2);
    trans = X';
    grad(1) = (1/m)*(trans(1,:))*(h_theta - y);
    grad(2:size(theta, 1)) = 1/m * (trans(2:size(trans,1),:)*(h_theta - y) +...
    lambda * theta(2:size(theta,1),:));
    grad = grad(:);
end

and...

function [all_theta] = oneVsAll(X, y, num_labels, lambda)
    m = size(X, 1);
    n = size(X, 2);
    all_theta = zeros(num_labels, n + 1);
    initial_theta = zeros(n+1, 1);
    X = [ones(m, 1) X];
    options = optimset('GradObj', 'on', 'MaxIter', 50);
       for c = 1:num_labels,
     [theta] = ...
         fmincg (@(t)(cost(t, X, (y == c), lambda)), ...
                 initial_theta, options);
     all_theta(c,:) = theta';
end

Long format:

Although probably not needed to follow the question, the dataset can be downloaded here and once downloaded and placed in the R directory, loaded as:

library(R.matlab)
data <- readMat('data.mat')
str(data)
List of 2
 $ X: num [1:5000, 1:400] 0 0 0 0 0 0 0 0 0 0 ...
 $ y: num [1:5000, 1] 10 10 10 10 10 10 10 10 10 10 ...

So X is a matrix with 5,000 examples, each containing 400 features, which happen to be 400 pixels of a 20 x 20 image of a handwritten digit from 1 to 10, as for example this 9:

enter image description here

Applying a logistic regression algorithm to predict the handwritten number based on "computer vision" of the values in these 400 pixels presents the additional challenge of not being a binary decision. Optimizing the coefficients is unlikely to be efficient with an ad hoc gradient descent loop as in this R-bloggers example.

There is a nicely worked out example also in R-bloggers based on two explanatory variables (features) and a dichotomous outcome. The example uses the optim() R function, which is seems to be the way to go.

Even though I have read the documentation, I am having problems setting up this more complex example, where we have to decide among 10 possible outcomes:

    library(R.matlab)
    data <- readMat('data.mat')

    X = data$X                 # These are the values for the pixels in all 5000 examples.
    y = data$y                 # These are the actual correct labels for each example.
    y = replace(y, y == 10, 0) # Replacing 10 with 0 for simplicity.

    # Defining the sigmoid function for logistic regression.
       sigmoid = function(z){
            1 / (1 + exp(-z))
       }

    X = cbind(rep(1, nrow(X)), X) # Adding an intercept or bias term (column of 1's).

    # Defining the regularized cost function parametrized by the coefficients.

       cost = function(theta){ 
           hypothesis = sigmoid(X%*%theta)
           # In "J" below we will need to have 10 columns of y:
           y = as.matrix(model.matrix(lm(y ~ as.factor(y))))
           m = nrow(y)
           lambda = 0.1
           # The regularized cost function is:
           J = (1/m) * sum(-y * log(hypothesis)  - (1 - y) * log(1 - hypothesis)) +
    (lambda/(2 * m)) * sum(theta[2:nrow(theta), 1]^2)
           J
        }

    no.pixels_plus1 = ncol(X)     # These are the columns of X plus the intercept.
    no.digits = length(unique(y)) # These are the number of labels (10).
    # coef matrix rows = no. of labels; cols = no. pixels plus intercept:
    theta_matrix = t(matrix(rep(0, no.digits*no.pixels_plus1), nrow = no.digits))
    cost(theta_matrix) # The initial cost:
    # [1] 0.6931472
    theta_optim = optim(par = theta_matrix, fn = cost) # This is the PROBLEM step!

Evidently this seems incomplete, and gives me the error message:

 Error in X %*% theta : non-conformable arguments

Notice that X%*%theta_matrix is carried out without any issues. So the problem has to be in the fact that I have 10 classifiers (0 to 9), and that I'm forced to create a matrix with 10 y column vectors to make the operations feasible with the function cost. It is possible that the solution goes through dummy-code the y vector with some line like: y = as.matrix(model.matrix(lm(y ~ as.factor(y)))) as in my non-working code above, but yet again, I don't know that this encapsulates the "one-versus-all" idea - OK, probably not, and probably this is the issue.

Otherwise, it seems to work on the R-bloggers post with a binary classifier and extremely parallel to identical code.

So what is the right syntax for this problem?

Note that I have tried to work it out one digit against all others, but I don't think that makes sense in terms of complexity.

588

asked May 31 '16 11:05

Antoni Parellada

Video Answer

1 Answers

The theta you feed to optim must be a vector. You can turn it into a matrix within the cost function.

See previous question here: How to get optim working with matrix multiplication inside the function to be maximized in R

101

answered Oct 11 '22 09:10

Philippe Marchand

Related questions
                            
                                Accessing same named list elements of the list of lists in R
                            
                                Efficient multiplication of columns in a data frame
                            
                                What is the most efficient way to sum all columns whose name starts with a pattern?
                            
                                Assign Value to Diagonal Entries of Matrix
                            
                                Capitalizing letters. R equivalent of excel "PROPER" function [duplicate]
                            
                                Gantt style time line plot (in base R)
                            
                                How do I convert the three letter amino acid codes to one letter code with python or R?
                            
                                strange characters: interaction of R and Windows locale?
                            
                                Emacs tab auto-complete for R data.table?
                            
                                Why is an R object so much larger than the same data in Stata/SPSS?
                            
                                Size of string in message function in R
                            
                                Is function(){} a true quine?
                            
                                How can I have more than 4 section colours in mermaid (Gantt) via DiagrammeR?
                            
                                Send a email with Attachment in R using Gmail

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With