Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance without going parallel for my backprop ANN

After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives, I would like to see if there is anything further I can do.

The activate(const double input[]) function is profiled to only take ~5% of the time. The gradient(const double input) function is implemented as follows:

inline double gradient(const double input) { return (1 - (input * input)); }

The training function in question:

void train(const vector<double>& data, const vector<double>& desired, const double learn_rate, const double momentum) {
        this->activate(data);
        this->calculate_error(desired);

        // adjust weights for layers
        const auto n_layers = this->config.size();
        const auto adjustment = (1 - momentum) * learn_rate;

        for (size_t i = 1; i < n_layers; ++i) {
            const auto& inputs = i - 1 > 0 ? this->outputs[i - 1] : data;
            const auto n_inputs = this->config[i - 1];
            const auto n_neurons = this->config[i];

            for (auto j = 0; j < n_neurons; ++j) {
                const auto adjusted_error = adjustment * this->errors[i][j];

                for (auto k = 0; k < n_inputs; ++k) {
                    const auto delta = adjusted_error * inputs[k] + (momentum * this->deltas[i][j][k]);

                    this->deltas[i][j][k] = delta;
                    this->weights[i][j][k] += delta;
                }

                const auto delta = adjusted_error * this->bias + (momentum * this->deltas[i][j][n_inputs]);

                this->deltas[i][j][n_inputs] = delta;
                this->weights[i][j][n_inputs] += delta;
            }
        }
    }
}

This question may be better suited to https://codereview.stackexchange.com/.

like image 965
deceleratedcaviar Avatar asked Apr 19 '11 08:04

deceleratedcaviar


1 Answers

You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.

If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.

For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .

Better is of course using the compiler intrinsics directly.

like image 178
Gunther Piez Avatar answered Oct 03 '22 21:10

Gunther Piez