Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What optimization hints can I give to the compiler/JIT?

I've already profiled, and am now looking to squeeze every possible bit of performance possible out of my hot-spot.

I know about [MethodImplOptions.AggressiveInlining] and the ProfileOptimization class. Are there any others?


[Edit] I just discovered [TargetedPatchingOptOut] as well. Nevermind, apparently that one is not needed.

like image 707
BlueRaja - Danny Pflughoeft Avatar asked Apr 30 '13 18:04

BlueRaja - Danny Pflughoeft


People also ask

How does JIT optimize code?

To help the JIT compiler analyze the method, its bytecodes are first reformulated in an internal representation called trees, which resembles machine code more closely than bytecodes. Analysis and optimizations are then performed on the trees of the method. At the end, the trees are translated into native code.

How can JIT be faster?

In theory, a Just-in-Time (JIT) compiler has an advantage over Ahead-of-Time (AOT) if it has enough time and computational resources available. A JIT compiler can be faster because the machine code is being generated on the exact machine that it will also execute on.

What is used to optimize compiler?

Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.

What is a compiler optimization tool for code optimization?

Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.


2 Answers

Yes there are more tricks :-)

I've actually did quite a bit of research on optimizing C# code. So far, these are the most significant results:

  1. Func's and Action's that are passed directly are often inlined by the JIT'ter. Note that you shouldn't store them as variable, because they are then called as delegates. See also this post for more details.
  2. Be careful with overloads. Calling Equals without using IEquatable<T> is usually a bad plan - so if you use f.ex. a hash, be sure to implement the right overloads and interfaces, because it'll safe you a ton of performance.
  3. Generics called from other classes are never inlined. The reason for this is the "magic" outlined here.
  4. If you use a data structure, make sure to try using an array instead :-) Really, these things are fast as hell compared to ... well, just about anything I suppose. I've optimized quite a bit of things by using my own hash tables and using arrays instead of list's.
  5. In a lot of cases, table lookups are faster than computing things or using constructions like vtable lookups, switches, multiple if statements and even calculations. This is also a good trick if you have branches; failed branch prediction can often become a big pain. See also this post - this is a trick I use quite a lot in C# and it works great in a lot of cases. Oh, and lookup tables are arrays of course.
  6. Experiment with making (small) classes structs. Because of the nature of value types, some optimizations are different for struct's than for class'es. For example, method calls are simpler, because the compiler knows exactly what method is going to get called. Also arrays of structs are usually faster than arrays of classes, because they require 1 memory operation less per array operation.
  7. Don't use multi-dimensional arrays. While I prefer Foo[], even Foo[][] is normally faster than Foo[,].
  8. If you're copying data, prefer Buffer.BlockCopy over Array.Copy any day of the week. Also be cautious around strings: string operations can be a performance drainer.

There also used to be a guide called "optimization for the intel pentium processor" with a large number of tricks (like shifting or multiplying instead of dividing). While the compiler does a fine effort nowadays, this also sometimes helps a bit.

Of course these are just optimizations; the biggest performance gains are usually the result of changing the algorithm and/or data structure. Be sure to check out which options are available to you and don't restrict yourself too much by the .NET framework... also I have a natural tendency to distrust the .NET implementation until I've checked the decompiled code by myself... there's a ton of stuff that could have been implemented much faster (most of the times for good reasons).

HTH


Alex pointed out to me that Array.Copy is actually faster according to some people. And since I really don't know what has changed over the years, I decided that the only proper course of action is to create a fresh new benchmark and put it to the test.

If you're just interested in the results, go down. In most cases the call to Buffer.BlockCopy clearly outperforms Array.Copy. Tested on an Intel Skylake with 16 GB memory (>10 GB free) on .NET 4.5.2.

Code:

static void TestNonOverlapped1(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K];     byte[] tmp2 = new byte[K];     for (long i = 0; i < iter; ++i)     {         Array.Copy(tmp, tmp2, K);     } }  static void TestNonOverlapped2(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K];     byte[] tmp2 = new byte[K];     for (long i = 0; i < iter; ++i)     {         Buffer.BlockCopy(tmp, 0, tmp2, 0, K);     } }  static void TestOverlapped1(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K + 16];     for (long i = 0; i < iter; ++i)     {         Array.Copy(tmp, 0, tmp, 16, K);     } }  static void TestOverlapped2(int K) {     long total = 1000000000;     long iter = total / K;     byte[] tmp = new byte[K + 16];     for (long i = 0; i < iter; ++i)     {         Buffer.BlockCopy(tmp, 0, tmp, 16, K);     } }  static void Main(string[] args) {     for (int i = 0; i < 10; ++i)     {         int N = 16 << i;          Console.WriteLine("Block size: {0} bytes", N);          Stopwatch sw = Stopwatch.StartNew();          {             sw.Restart();             TestNonOverlapped1(N);              Console.WriteLine("Non-overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestNonOverlapped2(N);              Console.WriteLine("Non-overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestOverlapped1(N);              Console.WriteLine("Overlapped Array.Copy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          {             sw.Restart();             TestOverlapped2(N);              Console.WriteLine("Overlapped Buffer.BlockCopy: {0:0.00} ms", sw.Elapsed.TotalMilliseconds);             GC.Collect(GC.MaxGeneration);             GC.WaitForFullGCComplete();         }          Console.WriteLine("-------------------------");     }      Console.ReadLine(); } 

Results on x86 JIT:

Block size: 16 bytes Non-overlapped Array.Copy: 4267.52 ms Non-overlapped Buffer.BlockCopy: 2887.05 ms Overlapped Array.Copy: 3305.01 ms Overlapped Buffer.BlockCopy: 2670.18 ms ------------------------- Block size: 32 bytes Non-overlapped Array.Copy: 1327.55 ms Non-overlapped Buffer.BlockCopy: 763.89 ms Overlapped Array.Copy: 2334.91 ms Overlapped Buffer.BlockCopy: 2158.49 ms ------------------------- Block size: 64 bytes Non-overlapped Array.Copy: 705.76 ms Non-overlapped Buffer.BlockCopy: 390.63 ms Overlapped Array.Copy: 1303.00 ms Overlapped Buffer.BlockCopy: 1103.89 ms ------------------------- Block size: 128 bytes Non-overlapped Array.Copy: 361.18 ms Non-overlapped Buffer.BlockCopy: 219.77 ms Overlapped Array.Copy: 620.21 ms Overlapped Buffer.BlockCopy: 577.20 ms ------------------------- Block size: 256 bytes Non-overlapped Array.Copy: 192.92 ms Non-overlapped Buffer.BlockCopy: 108.71 ms Overlapped Array.Copy: 347.63 ms Overlapped Buffer.BlockCopy: 353.40 ms ------------------------- Block size: 512 bytes Non-overlapped Array.Copy: 104.69 ms Non-overlapped Buffer.BlockCopy: 65.65 ms Overlapped Array.Copy: 211.77 ms Overlapped Buffer.BlockCopy: 202.94 ms ------------------------- Block size: 1024 bytes Non-overlapped Array.Copy: 52.93 ms Non-overlapped Buffer.BlockCopy: 38.84 ms Overlapped Array.Copy: 144.39 ms Overlapped Buffer.BlockCopy: 154.09 ms ------------------------- Block size: 2048 bytes Non-overlapped Array.Copy: 45.64 ms Non-overlapped Buffer.BlockCopy: 30.11 ms Overlapped Array.Copy: 118.33 ms Overlapped Buffer.BlockCopy: 109.16 ms ------------------------- Block size: 4096 bytes Non-overlapped Array.Copy: 30.93 ms Non-overlapped Buffer.BlockCopy: 30.72 ms Overlapped Array.Copy: 119.73 ms Overlapped Buffer.BlockCopy: 104.66 ms ------------------------- Block size: 8192 bytes Non-overlapped Array.Copy: 30.37 ms Non-overlapped Buffer.BlockCopy: 26.63 ms Overlapped Array.Copy: 90.46 ms Overlapped Buffer.BlockCopy: 87.40 ms ------------------------- 

Results on x64 JIT:

Block size: 16 bytes Non-overlapped Array.Copy: 1252.71 ms Non-overlapped Buffer.BlockCopy: 694.34 ms Overlapped Array.Copy: 701.27 ms Overlapped Buffer.BlockCopy: 573.34 ms ------------------------- Block size: 32 bytes Non-overlapped Array.Copy: 995.47 ms Non-overlapped Buffer.BlockCopy: 654.70 ms Overlapped Array.Copy: 398.48 ms Overlapped Buffer.BlockCopy: 336.86 ms ------------------------- Block size: 64 bytes Non-overlapped Array.Copy: 498.86 ms Non-overlapped Buffer.BlockCopy: 329.15 ms Overlapped Array.Copy: 218.43 ms Overlapped Buffer.BlockCopy: 179.95 ms ------------------------- Block size: 128 bytes Non-overlapped Array.Copy: 263.00 ms Non-overlapped Buffer.BlockCopy: 196.71 ms Overlapped Array.Copy: 137.21 ms Overlapped Buffer.BlockCopy: 107.02 ms ------------------------- Block size: 256 bytes Non-overlapped Array.Copy: 144.31 ms Non-overlapped Buffer.BlockCopy: 101.23 ms Overlapped Array.Copy: 85.49 ms Overlapped Buffer.BlockCopy: 69.30 ms ------------------------- Block size: 512 bytes Non-overlapped Array.Copy: 76.76 ms Non-overlapped Buffer.BlockCopy: 55.31 ms Overlapped Array.Copy: 61.99 ms Overlapped Buffer.BlockCopy: 54.06 ms ------------------------- Block size: 1024 bytes Non-overlapped Array.Copy: 44.01 ms Non-overlapped Buffer.BlockCopy: 33.30 ms Overlapped Array.Copy: 53.13 ms Overlapped Buffer.BlockCopy: 51.36 ms ------------------------- Block size: 2048 bytes Non-overlapped Array.Copy: 27.05 ms Non-overlapped Buffer.BlockCopy: 25.57 ms Overlapped Array.Copy: 46.86 ms Overlapped Buffer.BlockCopy: 47.83 ms ------------------------- Block size: 4096 bytes Non-overlapped Array.Copy: 29.11 ms Non-overlapped Buffer.BlockCopy: 25.12 ms Overlapped Array.Copy: 45.05 ms Overlapped Buffer.BlockCopy: 47.84 ms ------------------------- Block size: 8192 bytes Non-overlapped Array.Copy: 24.95 ms Non-overlapped Buffer.BlockCopy: 21.52 ms Overlapped Array.Copy: 43.81 ms Overlapped Buffer.BlockCopy: 43.22 ms ------------------------- 
like image 147
atlaste Avatar answered Oct 17 '22 08:10

atlaste


You've exhausted the options added in .NET 4.5 to affect the jitted code directly. Next step is to look at the generated machine code to spot any obvious inefficiencies. Do so with the debugger, first prevent it from disabling the optimizer. Tools + Options, Debugging, General, untick the "Suppress JIT optimization on module load" option. Set a breakpoint on the hot code, Debug + Disassembly to look at it.

There are not that many to consider, the jitter optimizer in general does an excellent job. One thing to look for is failed attempts at eliminating an array bounds check, the fixed keyword is an unsafe workaround for that. A corner case is a failed attempt at inlining a method and the jitter not using cpu registers effectively, an issue with the x86 jitter and fixed with MethodImplOptions.NoInlining. The optimizer is not terribly efficient at hoisting invariant code out of a loop, but that's something you'd almost always first consider when staring at the C# code when looking for ways to optimize it.

The most important thing to want to know is when you are done and just can't hope to make it any faster. You can only really get there by comparing apples and oranges and writing the hot code in native code using C++/CLI. Make sure that this code is compiled with #pragma unmanaged in effect so it gets the full optimizer love. There's a cost associated with switching from managed code to native code execution so do make sure the execution time of the native code is substantial enough. This is otherwise not necessarily easy to do and you certainly won't have a guarantee for success. Albeit that knowing you are done can save you a lot of time stumbling into dead alleys.

like image 43
Hans Passant Avatar answered Oct 17 '22 06:10

Hans Passant