How faster is <code>tensorflow-gpu</code> with AVX and AVX2 compared with it without AVX and AVX2? I tried to find an answer using Google but with no success. It's hard to recompile <code>tensorflow-gpu</code> for Windows. So, I want to know if it worth it.

If your computation is one giant matmul on CPU, you will get 3x speed-up on Xeon V3 (see benchmark here). But it's also possible to see no speed-up, presumably because there's not enough time spent in high arithmetic intensity ops executed on CPU. Here's a table from "High Performance Models" guide for training of resnet50 on CPU with difference optimizations. It looks like you can get 2.5 speed-up with best settings <pre class="prettyprint"><code>| Optimization | Data Format | Images/Sec | Intra threads | Inter Threads | : : : (step time) : : : | ------------ | ----------- | ------------ | ------------- | ------------- | | AVX2 | NHWC | 6.8 (147ms) | 4 | 0 | | MKL | NCHW | 6.6 (151ms) | 4 | 1 | | MKL | NHWC | 5.95 (168ms) | 4 | 1 | | AVX | NHWC | 4.7 (211ms) | 4 | 0 | | SSE3 | NHWC | 2.7 (370ms) | 4 | 0 | </code></pre> If you are able to compile an optimized version for Windows, it would help to mention it in this issue -- https://github.com/yaroslavvb/tensorflow-community-wheels/issues/13 , it seems there's some demand for such a build

How faster is tensorflow-gpu with AVX and AVX2 compared with it without AVX and AVX2?

1 Answers

If your computation is one giant matmul on CPU, you will get 3x speed-up on Xeon V3 (see benchmark here). But it's also possible to see no speed-up, presumably because there's not enough time spent in high arithmetic intensity ops executed on CPU.

Here's a table from "High Performance Models" guide for training of resnet50 on CPU with difference optimizations. It looks like you can get 2.5 speed-up with best settings

| Optimization | Data Format | Images/Sec   | Intra threads | Inter Threads |
:              :             : (step time)  :               :               :
| ------------ | ----------- | ------------ | ------------- | ------------- |
| AVX2         | NHWC        | 6.8 (147ms)  | 4             | 0             |
| MKL          | NCHW        | 6.6 (151ms)  | 4             | 1             |
| MKL          | NHWC        | 5.95 (168ms) | 4             | 1             |
| AVX          | NHWC        | 4.7 (211ms)  | 4             | 0             |
| SSE3         | NHWC        | 2.7 (370ms)  | 4             | 0             |

If you are able to compile an optimized version for Windows, it would help to mention it in this issue -- https://github.com/yaroslavvb/tensorflow-community-wheels/issues/13 , it seems there's some demand for such a build

answered Oct 22 '22 21:10

Yaroslav Bulatov

Related questions
                            
                                Material UI Potential Slow Performance of withStyles
                            
                                Python Performance on Windows
                            
                                How to speedup python unittest on muticore machines? [duplicate]
                            
                                Do abstract classes in Scala really perform better than traits?
                            
                                Efficiently color cycling an image in Java
                            
                                View or stored procedure for complex queries?
                            
                                Ruby GC execution exceeding ~250-320ms per request
                            
                                Efficient Rolling Max and Min Window
                            
                                Tomcat process killed by Linux kernel after running out of swap space; don't get any JVM OutOfMemory error
                            
                                Is branch prediction not working?
                            
                                Determining and preventing Android Scheduling delay
                            
                                Fastest way to calculate primes in C#?
                            
                                Is there a performance gain in including <script> tags as opposed to using eval?
                            
                                Performance with time related algorithm
                            
                                Performance of Func<T> and inheritance
                            
                                Android: why is native code so much faster than Java code
                            
                                How to optimize performance for a docker container?
                            
                                Node.JS performance vs native C++ addon when populating an Int32Array
                            
                                Why does the Linux Kernel use the data structures that it does?
                            
                                Improve Large ListView Adapter smooth scroll, sometimes jerky

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How faster is tensorflow-gpu with AVX and AVX2 compared with it without AVX and AVX2?

Tags:

performance

tensorflow

tensorflow-gpu

Dmitry

People also ask

1 Answers

Yaroslav Bulatov

Recent Activity

Donate For Us