I just tried using TPU in Google Colab and I want to see how much TPU is faster than GPU. I got surprisingly the opposite result.
The following is the NN.
  random_image = tf.random_normal((100, 100, 100, 3))
  result = tf.layers.conv2d(random_image, 32, 7)
  result = tf.reduce_sum(result)
Performance results:
CPU: 8s
GPU: 0.18s
TPU: 0.50s
I wonder why.... The complete code for TPU is as follows:
def calc():
  random_image = tf.random_normal((100, 100, 100, 3))
  result = tf.layers.conv2d(random_image, 32, 7)
  result = tf.reduce_sum(result)
  return result
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [], num_shards=8)
session = tf.Session(tpu_address)
try:
  print('Initializing global variables...')
  session.run(tf.global_variables_initializer())
  print('Warming up...')
  session.run(tf.contrib.tpu.initialize_system())
  print('Profiling')
  start = time.time()
  session.run(tpu_ops)
  end = time.time()
  elapsed = end - start
  print(elapsed)
finally:
  session.run(tf.contrib.tpu.shutdown_system())
  session.close()
Takeaways: From observing the training time, it can be seen that the TPU takes considerably more training time than the GPU when the batch size is small. But when batch size increases the TPU performance is comparable to that of the GPU.
The TPU is 15x to 30x faster than current GPUs and CPUs on production AI applications that use neural network inference.
“Artificial neural networks based on the AI applications used to train the TPUs are 15 and 30 times faster than CPUs and GPUs!”
An individual Edge TPU can perform 4 trillion operations per second (4 TOPS), using only 2 watts of power—in other words, you get 2 TOPS per watt. For example, the Edge TPU can execute state-of-the-art mobile vision models such as MobileNet V2 at almost 400 frames per second, and in a power efficient manner.
Benchmarking devices properly is hard, so please take everything you learn from these examples with a grain of salt. It's better in general to compare specific models you are interested in (e.g. running an ImageNet network) to understand performance differences. That said, I understand it's fun to do this, so...
Larger models will illustrate the TPU and GPU performance better.  Your example also is including the compilation time in the cost of the TPU call: every call after the first for a given program and shape will be cached, so you will want to tpu_ops once before starting the timer unless you want to capture the compilation time.
Currently each call to a TPU function copies the weights to the TPU before it can start running, this affects small operations more significantly. Here's an example that runs a loop on the TPU before returning to the CPU, with the following outputs.
. So you can actually run 100 iterations of this function in 0.55s.
import os
import time
import tensorflow as tf
def calc(n):
  img = tf.random_normal((128, 100, 100, 3))
  def body(_):
    result = tf.layers.conv2d(img, 32, 7)
    result = tf.reduce_sum(result)
    return result
  return tf.contrib.tpu.repeat(n[0], body, [0.0])
session = tf.Session('grpc://' + os.environ['COLAB_TPU_ADDR'])
try:
  print('Initializing TPU...')
  session.run(tf.contrib.tpu.initialize_system())
  for i in [1, 10, 100, 500]:
    tpu_ops = tf.contrib.tpu.batch_parallel(calc, [[i] * 8], num_shards=8)
    print('Warming up...')
    session.run(tf.global_variables_initializer())
    session.run(tpu_ops)
    print('Profiling')
    start = time.time()
    session.run(tpu_ops)
    end = time.time()
    elapsed = end - start
    print(i, elapsed)
finally:
  session.run(tf.contrib.tpu.shutdown_system())
  session.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With