Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Serving Performance Very Slow vs Direct Inference

I am running in the following scenario:

  • Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
  • 1 Tensorflow Serving Pod
  • 4 Inference Client Pods

What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.

I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.

And the performance is not looking good, for a 1080p images it looks like this:

Direct TF: 20ms for input tensor creation, 70ms for inference. TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.

The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.

Are there any performance tweaks I could do?

The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.

Many thanks in advance!

like image 377
Wojtek Turowicz Avatar asked Apr 02 '20 17:04

Wojtek Turowicz


1 Answers

This is from TF Serving documentation:

Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.

From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.

Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are

like image 51
Amir Mousavi Avatar answered Sep 28 '22 21:09

Amir Mousavi