Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't the JVM emit prefetch instructions on Windows x86

As the title states, why doesn't the OpenJDK JVM emit prefetch instruction on Windows x86? See OpenJDK Mercurial @ http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/windows_x86/vm/prefetch_windows_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {}
inline void Prefetch::write(void *loc, intx interval) {}

There are no comments and I've found no other resources besides the source code. I am asking because it does so for Linux x86, see http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c49dcaf78a65/src/os_cpu/linux_x86/vm/prefetch_linux_x86.inline.hpp

inline void Prefetch::read (void *loc, intx interval) {
#ifdef AMD64
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));
#endif // AMD64
}

inline void Prefetch::write(void *loc, intx interval) {
#ifdef AMD64

  // Do not use the 3dnow prefetchw instruction.  It isn't supported on em64t.
  //  __asm__ ("prefetchw (%0,%1,1)" : : "r" (loc), "r" (interval));
  __asm__ ("prefetcht0 (%0,%1,1)" : : "r" (loc), "r" (interval));

#endif // AMD64
}
like image 269
roookeee Avatar asked Jun 04 '17 14:06

roookeee


2 Answers

As JDK-4453409 indicates, prefetching was implemented in HotSpot JVM in JDK 1.4 to speed-up GC. That was more than 15 years ago, no one will remember now why it was not implemented on Windows. My guess is that Visual Studio (which has always been used to build HotSpot on Windows) basically didn’t understand prefetch instruction at these times. Looks like a place for improvement.

Anyway, the code you’ve asked about is used internally by JVM Garbage Collector. This is not what JIT generates. C2 JIT code generator rules are in the architecture definition file x86_64.ad, and there are rules to translate PrefetchRead, PrefetchWrite and PrefetchAllocation nodes to the corresponding x64 instructions.

An insteresting fact is that PrefetchRead and PrefetchWrite nodes are not created anywhere in the code. They exist only to support Unsafe.prefetchX intrinsics, however, they are removed in JDK 9.

The only case when JIT generates prefetch instruction is PrefetchAllocation node. You can verify with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly that PREFETCHNTA is indeed generated after object allocation, both on Linux and Windows.

class Test {
    public static void main(String[] args) {
        byte[] b = new byte[0];
        for (;;) {
            b = Arrays.copyOf(b, b.length + 1);
        }
    }
}

java.exe -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Test

# {method} {0x00000000176124e0} 'main' '([Ljava/lang/String;)V' in 'Test'
  ...
  0x000000000340e512: cmp    $0x100000,%r11d
  0x000000000340e519: ja     0x000000000340e60f
  0x000000000340e51f: movslq 0x24(%rsp),%r10
  0x000000000340e524: add    $0x1,%r10
  0x000000000340e528: add    $0x17,%r10
  0x000000000340e52c: mov    %r10,%r8
  0x000000000340e52f: and    $0xfffffffffffffff8,%r8
  0x000000000340e533: cmp    $0x100000,%r11d
  0x000000000340e53a: ja     0x000000000340e496
  0x000000000340e540: mov    0x60(%r15),%rbp
  0x000000000340e544: mov    %rbp,%r9
  0x000000000340e547: add    %r8,%r9
  0x000000000340e54a: cmp    0x70(%r15),%r9
  0x000000000340e54e: jae    0x000000000340e496
  0x000000000340e554: mov    %r9,0x60(%r15)
  0x000000000340e558: prefetchnta 0xc0(%r9)
  0x000000000340e560: movq   $0x1,0x0(%rbp)
  0x000000000340e568: prefetchnta 0x100(%r9)
  0x000000000340e570: movl   $0x200000f5,0x8(%rbp)  ;   {metadata({type array byte})}
  0x000000000340e577: mov    %r11d,0xc(%rbp)
  0x000000000340e57b: prefetchnta 0x140(%r9)
  0x000000000340e583: prefetchnta 0x180(%r9)    ;*newarray
                                                ; - java.util.Arrays::copyOf@1 (line 3236)
                                                ; - Test::main@9 (line 9)
like image 163
apangin Avatar answered Oct 05 '22 22:10

apangin


The files you cited all has asm code fragment (inline assembler), which is used by some C/C++ software in its own code (as apangin, the JVM expert pointed, mostly in GC code). And there is actually the difference: Linux, Solaris and BSD variants of x86_64 hotspot have prefetches in the hotspot and windows has them disabled/unimplemented which is partially strange, partially unexplainable why, and it may also make JVM bit (some percents; more on platforms without hardware prefetch) slower on Windows, but still will not help to sell more solaris/solaris paid support contracts for Sun/Oracle. Ross also guessed that inline asm syntax may be not supported with MS C++ compiler, but _mm_prefetch should (Who will open JDK bug to add it to the file?).

JVM hotspot is JIT, and the JITted code is emitted (generated) by JIT as bytes (while it is possible for JIT to copy code from its own functions into generated code or to emit call to the support functions, prefetches are emitted as bytes in hotspot). How can we find how it is emitted? Simple online way is to find some online searchable copy of jdk8u (or better in cross-reference like metager), for example on github: https://github.com/JetBrains/jdk8u_hotspot and do the search of prefetch or prefetch emit or prefetchr or lir_prefetchr. There are some relevant results:

Actual bytes emitted in JVM's c1 compiler / LIR in jdk8u_hotspot/src/cpu/x86/vm/assembler_x86.cpp:

void Assembler::prefetch_prefix(Address src) {
  prefix(src);
  emit_int8(0x0F);
}

void Assembler::prefetchnta(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetchr(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rax, src); // 0, src
}

void Assembler::prefetcht0(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rcx, src); // 1, src
}

void Assembler::prefetcht1(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rdx, src); // 2, src
}

void Assembler::prefetcht2(Address src) {
  NOT_LP64(assert(VM_Version::supports_sse(), "must support"));
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x18);
  emit_operand(rbx, src); // 3, src
}

void Assembler::prefetchw(Address src) {
  assert(VM_Version::supports_3dnow_prefetch(), "must support");
  InstructionMark im(this);
  prefetch_prefix(src);
  emit_int8(0x0D);
  emit_operand(rcx, src); // 1, src
}

Usage in c1 LIR: src/share/vm/c1/c1_LIRAssembler.cpp

void LIR_Assembler::emit_op1(LIR_Op1* op) {
  switch (op->code()) { 
...
    case lir_prefetchr:
      prefetchr(op->in_opr());
      break;

    case lir_prefetchw:
      prefetchw(op->in_opr());
      break;

Now we know the opcode lir_prefetchr and can search for it or in OpenGrok xref and lir_prefetchw, to find the only example in src/share/vm/c1/c1_LIR.cpp

void LIR_List::prefetch(LIR_Address* addr, bool is_store) {
  append(new LIR_Op1(
            is_store ? lir_prefetchw : lir_prefetchr,
            LIR_OprFact::address(addr)));
}

There are other place where prefetch instructions are defined (for C2, as noted by apangin), the src/cpu/x86/vm/x86_64.ad:

// Prefetch instructions. ...
instruct prefetchr( memory mem ) %{
  predicate(ReadPrefetchInstr==3);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHR $mem\t# Prefetch into level 1 cache" %}
  ins_encode %{
    __ prefetchr($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrNTA( memory mem ) %{
  predicate(ReadPrefetchInstr==0);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch into non-temporal cache for read" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT0( memory mem ) %{
  predicate(ReadPrefetchInstr==1);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# prefetch into L1 and L2 caches for read" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchrT2( memory mem ) %{
  predicate(ReadPrefetchInstr==2);
  match(PrefetchRead mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# prefetch into L2 caches for read" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchwNTA( memory mem ) %{
  match(PrefetchWrite mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

// Prefetch instructions for allocation.

instruct prefetchAlloc( memory mem ) %{
  predicate(AllocatePrefetchInstr==3);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHW $mem\t# Prefetch allocation into level 1 cache and mark modified" %}
  ins_encode %{
    __ prefetchw($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocNTA( memory mem ) %{
  predicate(AllocatePrefetchInstr==0);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHNTA $mem\t# Prefetch allocation to non-temporal cache for write" %}
  ins_encode %{
    __ prefetchnta($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT0( memory mem ) %{
  predicate(AllocatePrefetchInstr==1);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT0 $mem\t# Prefetch allocation to level 1 and 2 caches for write" %}
  ins_encode %{
    __ prefetcht0($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}

instruct prefetchAllocT2( memory mem ) %{
  predicate(AllocatePrefetchInstr==2);
  match(PrefetchAllocation mem);
  ins_cost(125);

  format %{ "PREFETCHT2 $mem\t# Prefetch allocation to level 2 cache for write" %}
  ins_encode %{
    __ prefetcht2($mem$$Address);
  %}
  ins_pipe(ialu_mem);
%}
like image 34
osgx Avatar answered Oct 05 '22 23:10

osgx