Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I write code to hint to the JVM to use vector operations?

Tags:

Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.

I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).

For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?

I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.

Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).

Thanks!

like image 793
user450775 Avatar asked May 03 '14 22:05

user450775


People also ask

Does Java have vectorization?

In Java, currently vectorization is not done by the programmer1, but it's done automatically by the compiler. The compiler takes in standard Java bytecode and automatically determines which part can be transformed to vector instructions.

Does Java use SIMD?

While you can use SIMD intrinsics in languages closer to the metal such as C/C++, no such option exists in Java so far. Note this doesn't mean Java wouldn't take advantage of SIMD at all: the JIT compiler can auto-vectorize code in specific situations, i.e. transforming code from a loop into vectorized code.


2 Answers

How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)

Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.

There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.

There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.

Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.

So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.

Use a low-level language designed for this task and link it to the Java for high-level logic.

EDIT: adding some more info based on comments

If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.

In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:

  • openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
  • openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
  • openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
  • openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
  • openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
  • openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
  • openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
  • openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
  • openjdk/hotspot/src/share/vm/runtime/globals.hpp

They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.

I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions

like image 89
xmojmr Avatar answered Sep 28 '22 14:09

xmojmr


SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).

I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.

It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.

like image 22
Nitsan Wakart Avatar answered Sep 28 '22 14:09

Nitsan Wakart