Scalable, Efficient Hierarchical Softmax in Tensorflow?

Tags:

I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.

There are a few options that I see here:

1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.

2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.

3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.

Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?

358

asked May 23 '17 16:05

Wesley Tansey

1 Answers

You mention that you want GPU-class performance:

but now keeps everything on the CPU and slows things down quite a bit

and wish to use 300-unit hidden size and 10M-word dictionaries.

This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.

Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.

Realistically, you'll need a lot more GPU memory, because you'll also need to store:

other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead

Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.

However, you now have another problem:

To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.

What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.

The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)

What is the best way to do this to both be scalable to large class counts and efficient?

The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.

Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.

If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:

400 == batch size << number of classes == 3K

In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.

In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.

answered Oct 02 '22 19:10

MWB

Related questions
                            
                                How can I use the Keras OCR example?
                            
                                Full text search(Postgres) Vs Elastic search
                            
                                Is there a way to make Visual Studio Code recognize HTML syntax in PHP files
                            
                                PyInstaller, how to include data files from an external package that was installed by pip?
                            
                                Is this a good way to determine OS Architecture?
                            
                                Building a custom Linux Live CD
                            
                                Hibernate-like framework for C++ [closed]
                            
                                CORBA from .NET - (dis)recommended libraries? [closed]
                            
                                How do I get a book graphic and description from the Amazon Book API?
                            
                                Is there any Ruby or Python interpreter for Lego Mindstorm?
                            
                                Where do I put my C# delegate declaration, in a file of its own?
                            
                                How do I preserve a complex object across Activity restarts?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scalable, Efficient Hierarchical Softmax in Tensorflow?

Tags:

Wesley Tansey

People also ask

1 Answers

MWB

Recent Activity

Donate For Us