Merge weights of same model trained on 2 different computers using tensorflow

Question

I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?

Any help on this problem would be useful

Thanks in advance

Any help on this problem would be useful

Thanks in advance

Dr. Snoopy · Accepted Answer

There is literally no way to merge weights, you cannot average or combine them in any way, as the result will not mean anything. What you could do instead is combine predictions, but for that the training classes have to be the same.

This is not a programming limitation but a theoretical one.

Andre Holzner · Answer

It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.

There are two things you can do:

Look at 'data parallel training': distributing forward and backward passes of the training process over multiple compute nodes each of which has a subset of the entire data.

In this case typically:

each node propagates a minibatch forward through the network
each node propagates the loss gradient backwards through the network
a 'master node' collects gradients from minibatches on all nodes and updates the weights correspondingly
and distributes the weight updates back to the compute nodes to make sure each of them has the same set of weights

(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.

Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.

If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.

Merge weights of same model trained on 2 different computers using tensorflow

Tags:

Abhishek Venkataram

2 Answers

Dr. Snoopy

Andre Holzner

Recent Activity

Donate For Us

Merge weights of same model trained on 2 different computers using tensorflow

Tags:

Abhishek Venkataram

2 Answers

Dr. Snoopy

Andre Holzner

Related questions

Recent Activity

Donate For Us