Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge weights of same model trained on 2 different computers using tensorflow

Tags:

I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?

Any help on this problem would be useful

Thanks in advance

like image 492
Abhishek Venkataram Avatar asked Jan 20 '18 17:01

Abhishek Venkataram


2 Answers

There is literally no way to merge weights, you cannot average or combine them in any way, as the result will not mean anything. What you could do instead is combine predictions, but for that the training classes have to be the same.

This is not a programming limitation but a theoretical one.

like image 189
Dr. Snoopy Avatar answered Sep 21 '22 12:09

Dr. Snoopy


It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.

There are two things you can do:

  1. Look at 'data parallel training': distributing forward and backward passes of the training process over multiple compute nodes each of which has a subset of the entire data.

In this case typically:

  • each node propagates a minibatch forward through the network
  • each node propagates the loss gradient backwards through the network
  • a 'master node' collects gradients from minibatches on all nodes and updates the weights correspondingly
  • and distributes the weight updates back to the compute nodes to make sure each of them has the same set of weights

(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.

Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.


  1. If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.
like image 31
Andre Holzner Avatar answered Sep 22 '22 12:09

Andre Holzner