In my experience, .NET is 2 to 3 times slower than native code. (I implemented L-BFGS for multivariate optimization).
I have traced the ads on stackoverflow to http://www.centerspace.net/products/
the speed is really amazing, the speed is close to native code. How can they do that? They said that:
Q. Is NMath "pure" .NET?
A. The answer depends somewhat on your definition of "pure .NET". NMath is written in C#, plus a small Managed C++ layer. For better performance of basic linear algebra operations, however, NMath does rely on the native Intel Math Kernel Library (included with NMath). But there are no COM components, no DLLs--just .NET assemblies. Also, all memory allocated in the Managed C++ layer and used by native code is allocated from the managed heap.
Can someone explain more to me?
How can they do that?
Like most of the numerical libraries for .NET, NMath is little more than a wrapper over an Intel MKL embedded in the .NET assembly, probably by linking with C++/CLI to create a mixed assembly. You've probably just benchmarked those bits that are not actually written in .NET.
The F#.NET Journal articles Numerical Libraries: special functions, interpolation and random numbers (16th March 2008) and Numerical Libraries: linear algebra and spectral methods (16th April 2008) tested quite a bit of functionality and NMath was actually the slowest of all the commercial libraries. Their PRNG was slower than all others and 50% slower than the free Math.NET library, some basic functionality was missing (e.g. the ability to calculate Gamma(-0.5)
) and other basic functionality (the Gamma-related functions they did provide) was broken. Both Extreme Optimization and Bluebit beat NMath at the eigensolver benchmark. NMath didn't even provide a Fourier Transform at the time.
Even more surprisingly, the performance discrepancies were sometimes huge. The most expensive commercial numerical library we tested (IMSL) was over 500× slower than the free FFTW library at the FFT benchmark and none of the libraries made any use of multiple cores at the time.
In fact, it was precisely the poor quality of these libraries that encouraged us to commercialize our own F# for Numerics library (which is 100% pure F# code).
I am one of the lead developers of ILNumerics. So I am biased, obviously ;) But we are more disclosed regarding our internals, so I will give some insights into our speed 'secrets'.
It all depends on how system resources are utilized! If you are about pure speed and need to handle large arrays, you will make sure to (ordered by importance, most important first)
Manage your memory appropriately! 'Naive' memory management will lead to bad performance, since it stresses the GC badly, causes memory fragmentation and degrades memory locality (hence cache performance). In a garbage collected environment like .NET, this boils down to preventing from frequent memory allocations. In ILNumerics, we implemented a high performance memory pool in order to archieve this goal (and deterministic disposal of temporary arrays to get a nice, comfortable syntax without clumsy function semantics).
Utilize parallelism! This targets both: thread level parallelism and data level parallelism. Multiple cores are utilized by threading computation intensive parts of the calculation. On X86/X64 CPUs SIMD/multimedia extensions like SSE.XX and AVX allow a small but effective vectorization. They are not directly addressable by current .NET languages. And this is the only reason, why MKL may is still faster than 'pure' .NET code. (But solutions are already rising.)
To archieve the speed of highly optimized languages like FORTRAN and C++, the same optimizations must get applied to your code as done for them. C# offers the option do do so.
Note, these precautions should be followed in that order! It does not make sense to care about SSE extensions or even bound check removal, if the bottleneck is the memory bandwith and the processor(s) spend most the time waiting for new data. Also, for many simple operations it does not even pays of to invest huge efforts to archieve the very last tiny scale up to peak performance! Consider the common example of the LAPACK function DAXPY. It adds the elements of a vector X to the corresponding element of another vector Y. If this is done for the first time, all the memory for X and Y will have to get fetched from the main memory. There is little to nothing you can do about it. And memory is the bottleneck! So regardless if the addition at the end is done the naive way in C#
for (int i = 0; i < C.Length; i++) {
C[i] = X[i] + Y[i];
}
or done by using vectorization strategies - it will have to wait for memory!
I know, this answer does somehow 'over answers' the question, since most of these strategies are currently not utilized from the mentioned product (yet?). By following thoses points, you would eventually end up with much better performance than every naive implementation in a 'native' language.
If you are interested, you could disclose your implementation of L-BFGS? I'll be happy to convert it to ILNumerics and post comparison results and I am sure, other libraries listed here would like to follow. (?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With