I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box. By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes. <h3>Update:</h3> Thanks Olivier and Keveman. let me add more detail about my problem. The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1)) The solutions kaveman gave seem reasonable, and I will update results after trying.

The answer to this question depends greatly on what operations you want to perform on the weight matrix. The typical way to handle such a large number of features is to treat the <code>512</code> vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the <code>tf.nn.embedding_lookup</code> function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the <code>tf.nn.embedding_lookup_sparse</code> to lookup the embeddings. In both these cases, your weight matrix can be distributed across many machines. That is, the <code>params</code> argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at <code>tf.device</code> and the primer on distributed execution to understand how data and computation can be distributed across many machines. If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.

How tensorflow deals with large Variables which can not be stored in one box

Tags:

tensorflow

I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.

By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.

Update:

Thanks Olivier and Keveman. let me add more detail about my problem. The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))

The solutions kaveman gave seem reasonable, and I will update results after trying.

528

asked Jul 13 '16 12:07

Hanbin Zheng

1 Answers

The answer to this question depends greatly on what operations you want to perform on the weight matrix.

The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.

In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.

If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.

answered Nov 15 '22 07:11

keveman

Related questions
                            
                                Tensorflow error in Colab - ValueError: Shapes (None, 1) and (None, 10) are incompatible
                            
                                How to convert from Tensorflow.js (.json) model into Tensorflow (SavedModel) or Tensorflow Lite (.tflite) model?
                            
                                TypeError: Could not build a TypeSpec for a column
                            
                                Anaconda reading wrong CUDA version
                            
                                How to implement Grad-CAM on a trained network
                            
                                How to use sample weights with tensorflow datasets?
                            
                                InvalidArgumentError: required broadcastable shapes at loc(unknown)
                            
                                Can't optimize multivariate linear regression in Tensorflow
                            
                                What is a tensorflow float ref?
                            
                                Stable results with TensorFlow
                            
                                TensorFlow apply_gradients remotely
                            
                                'utf-8' decode error in tensorflow tutorial
                            
                                Tensorflow gradients
                            
                                TensorFlow - Invalid argument: Reshape:0 is both fed and fetched
                            
                                Packing array into lower triangular of a tensor
                            
                                How do you load an LMDB file into TensorFlow?
                            
                                Where can I find the Keras configuration file?
                            
                                Tensorflow LSTM RNN output activation function
                            
                                How can you get length of a TensorFlow string?
                            
                                Building and linking shared Tensorflow library on OSX El Capitan to call from Ruby via Swig

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With