After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives, I would like to see if there is anything further I can do.
The activate(const double input[])
function is profiled to only take ~5% of the time.
The gradient(const double input)
function is implemented as follows:
inline double gradient(const double input) { return (1 - (input * input)); }
The training function in question:
void train(const vector<double>& data, const vector<double>& desired, const double learn_rate, const double momentum) {
this->activate(data);
this->calculate_error(desired);
// adjust weights for layers
const auto n_layers = this->config.size();
const auto adjustment = (1 - momentum) * learn_rate;
for (size_t i = 1; i < n_layers; ++i) {
const auto& inputs = i - 1 > 0 ? this->outputs[i - 1] : data;
const auto n_inputs = this->config[i - 1];
const auto n_neurons = this->config[i];
for (auto j = 0; j < n_neurons; ++j) {
const auto adjusted_error = adjustment * this->errors[i][j];
for (auto k = 0; k < n_inputs; ++k) {
const auto delta = adjusted_error * inputs[k] + (momentum * this->deltas[i][j][k]);
this->deltas[i][j][k] = delta;
this->weights[i][j][k] += delta;
}
const auto delta = adjusted_error * this->bias + (momentum * this->deltas[i][j][n_inputs]);
this->deltas[i][j][n_inputs] = delta;
this->weights[i][j][n_inputs] += delta;
}
}
}
}
This question may be better suited to https://codereview.stackexchange.com/.
You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.
If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.
For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .
Better is of course using the compiler intrinsics directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With