Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Quantize" Tensorflow Graph to float16

Tags:

tensorflow

How do you convert a Tensorflow graph from using float32 to float16? Currently there are graph optimizations for quantization and conversion to eight bit ints.

Trying to load float32 weights into a float16 graph fails with:

DataLossError (see above for traceback): Invalid size in bundle entry: key model/conv5_1/biases; stored size 1536; expected size 768
     [[Node: save/RestoreV2_16 = RestoreV2[dtypes=[DT_HALF], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_16/tensor_names, save/RestoreV2_16/shape_and_slices)]]
     [[Node: save/RestoreV2_3/_39 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_107_save/RestoreV2_3", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
like image 249
Alex Rothberg Avatar asked Mar 14 '17 17:03

Alex Rothberg


People also ask

Are TFLite models quantized?

TfLite models are also quantized, due to which they are not as accurate as the original models. To solve this issue, quantization aware training can be used. It converts the weights to int-8 while training before converting it back to 32-bit float, so it acts like noise for the models forcing them to learn accordingly.

Does quantization reduce model size?

Quantization works by reducing the precision of the numbers used to represent a model's parameters, which by default are 32-bit floating point numbers. This results in a smaller model size and faster computation.

What is FP16 quantization?

Post-training float16 quantization reduces TensorFlow Lite model sizes (up to 50%), while sacrificing very little accuracy. It quantizes model constants (like weights and bias values) from full precision floating point (32-bit) to a reduced precision floating point data type (IEEE FP16).

What is dynamic range quantization?

Dynamic range quantization To further reduce latency during inference, "dynamic-range" operators dynamically quantize activations based on their range to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inferences.


1 Answers

I think my solution is definitely not the best and not the one which is the most straight forward, but as nobody else posted anything:

What I did was training the network with full precision and saved them in a checkpoint. Then I built a copy of the network setting all variables desired to a dtype of tf.float16 and removing all the training nodes. Finally, I loaded and casted the variables the following way:

previous_variables = [
  var_name for var_name, _
  in tf.contrib.framework.list_variables('path-to-checkpoint-file')]
#print(previous_variables)
sess.run(tf.global_variables_initializer())
restore_map = {}
for variable in tf.global_variables():
    if variable.op.name in previous_variables:
        var = tf.contrib.framework.load_variable(
            'path-to-checkpoint-file', variable.op.name)
        if(var.dtype == np.float32):
            tf.add_to_collection('assignOps', variable.assign(
                tf.cast(var, tf.float16)))
        else:
            tf.add_to_collection('assignOps', variable.assign(var))
sess.run(tf.get_collection('assignOps'))

This obviously has issues if there are tensors of float32 that you don't want to convert, which I luckily don't have as I want to convert all my nodes to float16 precision. In case you have those you could further filter with other if statements. I hope this answers your question.

like image 170
Jendrik Avatar answered Oct 07 '22 05:10

Jendrik