Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to run recurrent neural network (inference) on mobile device

What I have: A trained recurrent neural network in Tensorflow.

What I want: A mobile application that can run this network as fast as possible (inference mode only, no training).

I believe there are multiple ways how I can accomplish my goal, but I would like you feedback/corrections and additions because I have never done this before.

  1. Tensorflow Lite. Pro: Straight forward, available on Android and iOS. Contra: Probably not the fastest method, right?
  2. TensorRT. Pro: Very fast + I can write custom C code to make it faster. Contra: Used for Nvidia devices so no easy way to run on Android and iOS, right?
  3. Custom Code + Libraries like openBLAS. Pro: Probably very fast and possibility to link to it on Android on iOS (if I am not mistaken). Contra: Is there much use for recurrent neural networks? Does it really work well on Android + iOS?
  4. Re-implement Everything. I could also rewrite the whole computation in C/C++ which shouldn't be too hard with recurrent neural networks. Pro: Probably the fastest method because I can optimize everything. Contra: Will take a long time and if the network changes I have to update my code as well (although I am willing to do it this way if it really is the fastest). Also, how fast can I make calls to libraries (C/C++) on Android? Am I limited by the Java interfaces?

Some details about the mobile application. The application will take a sound recording of the user, do some processing (like Speech2Text) and output the text. I do not want to find a solution that is "fast enough", but the fastest option because this will happen over very large sound files. So almost every speed improvement counts. Do you have any advice, how I should approach this problem?

Last question: If I try to hire somebody to help me out, should I look for an Android/iOS-, Embedded- or Tensorflow- type of person?

like image 589
user667804 Avatar asked Mar 09 '18 12:03

user667804


People also ask

Which is the most famous recurrent neural network?

The context units in a Jordan network are also referred to as the state layer. They have a recurrent connection to themselves. Elman and Jordan networks are also known as "Simple recurrent networks" (SRN).

What is the name of the algorithm used for training an RNN?

The backpropagation algorithm of an artificial neural network is modified to include the unfolding in time to train the weights of the network. This algorithm is based on computing the gradient vector and is called back propagation in time or BPTT algorithm for short. The pseudo-code for training is given below.

Why do recurrent neural network works better with text data?

Because of their internal memory, RNN's can remember important things about the input they received, which allows them to be very precise in predicting what's coming next. This is why they're the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more.

What is the neural magic inference engine?

The Neural Magic Inference Engine works by optimizing how a neural network is executed across the available memory hierarchies in a CPU. The associated engine algorithms identify memory-bound processes within the network – like depthwise convolutions, as an example – and apply optimization techniques to accelerate performance of those components.

Can mobilenetv2 be pruned further?

That being said, MobileNetV2 can be pruned further if accuracy recovery trade-off is acceptable. The Neural Magic Inference Engine works by optimizing how a neural network is executed across the available memory hierarchies in a CPU.

What is the maximum IPS (images-per-second) that neural magic inference engine can achieve?

The graph shows the maximum IPS (images-per-second) Neural Magic Inference Engine was able to achieve with MobileNetV2 for batch size 1, fp 32, on a 4-core CPU. On a 4-core CPU, Neural Magic Inference Engine is capable of achieving 12.7x better performance than a standalone 4-core CPU, 4.5x better than DNNL, and 1.2x better than OpenVINO.


1 Answers

1. TensorflowLite

Pro: it uses GPU optimizations on Android; fairly easy to incorporate into Swift/Objective-C app, and very easy into Java/Android (just adding one line in gradle.build); You can transform TF model to CoreML

Cons: if you use C++ library - you will have some issues adding TFLite as a library to your Android/Java-JNI (there is no native way to build such library without JNI); No GPU support on iOS (community works on MPS integration tho)

Also here is reference to TFLite speech-to-text demo app, it could be useful.

2. TensorRT

It uses TensorRT uses cuDNN which uses CUDA library. There is CUDA for Android, not sure if it supports the whole functionality.

3. Custom code + Libraries

I would recommend you to use Android NNet library and CoreML; in case you need to go deeper - you can use Eigen library for linear algebra. However, writing your own custom code is not beneficial in the long term, you would need to support/test/improve it - which is a huge deal, more important than performance.

Re-implement Everything

This option is very similar to the previous one, implementing your own RNN(LSTM) should be fine, as soon as you know what you are doing, just use one of the linear algebra libraries (e.g. Eigen).

The overall recommendation would be to:**

  • try to do it server side: use some lossy compression and serverside speech2text;
  • try using Tensorflow Lite; measure performance, find bottlenecks, try to optimize
  • if some parts of TFLite would be too slow - reimplement them in custom operations; (and make PR to the Tensorflow)
  • if bottlenecks are on the hardware level - goto 1st suggestion
like image 136
Stanislav Levental Avatar answered Oct 01 '22 17:10

Stanislav Levental