Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear regression poor gradient descent performance

I have implemented a simple Linear Regression (single variate for now) example in C++ to help me get my head around the concepts. I'm pretty sure that the key algorithm is right but my performance is terrible.

This is the method which actually performs the gradient descent:

void LinearRegression::BatchGradientDescent(std::vector<std::pair<int,int>> & data,float& theta1,float& theta2)
{

    float weight = (1.0f/static_cast<float>(data.size()));
    float theta1Res = 0.0f;
    float theta2Res = 0.0f;

    for(auto p: data)
    {

        float cost = Hypothesis(p.first,theta1,theta2) - p.second;
        theta1Res += cost;
        theta2Res += cost*p.first;
    }   

    theta1 = theta1 - (m_LearningRate*weight* theta1Res);
    theta2 = theta2 - (m_LearningRate*weight* theta2Res);
}

With the other key functions given as:

float LinearRegression::Hypothesis(float x,float theta1,float theta2) const
{
    return theta1 + x*theta2;
}


float LinearRegression::CostFunction(std::vector<std::pair<int,int>> & data,
                                     float theta1,
                                     float theta2) const
{ 
    float error = 0.0f;
    for(auto p: data)
    {

        float prediction = (Hypothesis(p.first,theta1,theta2) - p.second) ;
        error += prediction*prediction;
    }

    error *= 1.0f/(data.size()*2.0f);
    return error;
}

void LinearRegression::Regress(std::vector<std::pair<int,int>> & data)
{
    for(unsigned int itr = 0; itr < MAX_ITERATIONS; ++itr)
    {
       BatchGradientDescent(data,m_Theta1,m_Theta2);
       //Some visualisation code
    }
}

Now the issue is that if the learning rate is greater than around 0.000001 the value of the cost function after gradient descent is higher than it is before. That is to say, the algorithm is working in reverse. The line forms into a straight line through the origin pretty quickly but then takes millions of iterations to actually reach a reasonably well fit line.

With a learning rate of 0.01, after six iterations the output is: (where difference is costAfter-costBefore)

Cost before 102901.945312, cost after 517539430400.000000, difference 517539332096.000000
Cost before 517539430400.000000, cost after 3131945127824588800.000000, difference 3131944578068774912.000000
Cost before 3131945127824588800.000000, cost after 18953312418560698826620928.000000, difference 18953308959796185006080000.000000
Cost before 18953312418560698826620928.000000, cost after 114697949347691988409089177681920.000000, difference 114697930004878874575022382383104.000000
Cost before 114697949347691988409089177681920.000000, cost after inf, difference inf
Cost before inf, cost after inf, difference nan

In this example the thetas are set to zero, the learning rate to 0.000001, and there are 8,000,000 iterations! The visualisation code only updates the graph after every 100,000 iterations.

enter image description here

Function which creates the data points:

static void SetupRegressionData(std::vector<std::pair<int,int>> & data)
{
    srand (time(NULL));

    for(int x = 50; x < 750; x += 3)
    {
        data.push_back(std::pair<int,int>(x+(rand() % 100), 400 + (rand() % 100) ));
    }
}

In short, if my learning rate is too high the gradient descent algorithm effectively runs backwards and tends to infinity and if it is lowered to the point where it actually converges towards a minima the number of iterations required to actually do so is unacceptably high.

Have I missed anything/made a mistake in the core algorithm?

like image 442
Davors72 Avatar asked May 01 '15 23:05

Davors72


1 Answers

Looks like everything is behaving as expected, but you are having problems selecting a reasonable learning rate. That's not a totally trivial problem, and there are many approaches ranging from pre-defined schedules that progressively reduce the learning rate (see e.g. this paper) to adaptive methods such as AdaGrad or AdaDelta.

For your vanilla implementation with fixed learning rate you should make your life easier by normalising the data to zero mean and unit standard deviation before you feed it into the gradient descent algorithm. That way you will be able to reason about the learning rate more easily. Then you can just rescale your prediction accordingly.

like image 167
DG2 Avatar answered Oct 14 '22 01:10

DG2