Program exceeding theoretical memory transfer rate

Tags:

I have a laptop with Intel Core 2 Duo 2.4GHz CPU and and 2x4Gb DDR3 modules 1066MHz.

I expect that this this memory could operate at speed 1067 MiB/sec, and as long as there are two channels, maximum speed is 2134 MiB/sec (in case OS memory dispatcher will allow).

I made a tiny Java app to test that:

private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];

private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec

public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Random rnd = new Random();
    byte[] buf1 = new byte[s];
    rnd.nextBytes(buf1);
    long count = 0;
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(buf1, 0, storage, (int) begin, s);
        ++count;
    }
    double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    double speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");

    byte[] buf2 = new byte[s];
    count = 0;
    start = System.currentTimeMillis();
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(storage, (int) begin, buf2, 0, s);
        Arrays.fill(buf2, (byte) 0);
        ++count;
    }
    totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}

I expected the result to be under 2134 MiB/sec however I have got the following:

17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)

How is that possible that speed was almost 3 GiB/sec?

DDR3 module photo

506

asked Jul 03 '15 19:07

Antonio

2 Answers

Here are multiple things at work.

First of all: the formula for memory transfer rate of DDR3 is

memory clock rate
× 4  (for bus clock multiplier)
× 2  (for data rate)
× 64 (number of bits transferred)
/ 8  (number of bits/byte)
=    memory clock rate × 64 (in MB/s)

For DDR3-1066 (which is clocked at 133⅓ MHz), we obtain a theoretical memory bandwidth8533⅓ MB/s or 8138.02083333... MiB/s for single-channel, and 17066⅔ MB/s, or 16276.0416666... MiB/s for dual-channel.

Second: transfer of one big chunk of data is faster than transfer of many small chunks of data.

Third: the test ignores caching effects, which can occur.

Fourth: if one makes time measurements, one should use System.nanoTime(). This method is more precise.

Here is a rewritten version of the test program ¹.

import java.util.Random;

public class Main {

  public static void main(String... args) {
    final int SIZE = 1024 * 1024 * 1024;
    final int RUNS = 8;
    final int THREADS = 8;
    final int TSIZE = SIZE / THREADS;
    assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
    byte[] src = new byte[SIZE];
    byte[] dest = new byte[SIZE];
    Random r = new Random();
    long timeNano = 0;

    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < RUNS; ++i) {
      System.out.print("Initializing src... ");
      for (int idx = 0; idx < SIZE; ++idx) {
        src[idx] = ((byte) r.nextInt(256));
      }
      System.out.println("done!");
      System.out.print("Starting test... ");
      for (int idx = 0; idx < THREADS; ++idx) {
        final int from = TSIZE * idx;
        threads[idx]
            = new Thread(() -> {
          System.arraycopy(src, from, dest, 0, TSIZE);
        });
      }
      long start = System.nanoTime();
      for (int idx = 0; idx < THREADS; ++idx) {
        threads[idx].start();
      }
      for (int idx = 0; idx < THREADS; ++idx) {
        try {
          threads[idx].join();
        } catch (InterruptedException e) {
          e.printStackTrace();
        }
      }
      timeNano += System.nanoTime() - start;
      System.out.println("done!");
    }
    double timeSecs = timeNano / 1_000_000_000d;

    System.out.println("Transfered " + (long) SIZE * RUNS
        + " bytes in " + timeSecs + " seconds.");

    System.out.println("-> "
        + ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
        + " GiB/s");
  }
}

This way, as much "other computation" as possible is mitigated and (almost) only memory copy rate via System.arraycopy(...) is measured. This algorithm may still have issues with regards to caching.

For my system (Dual Channel DDR3-1600), I get something around 6 GiB/s, whereas the theoretical limit is around 25 GiB/s (including DualChannel).

As was pointed out by Nick Mertin, the JVM introduces some overhead. Therefore, it is expected that you are not able to reach the theoretical limit.

_{¹ Sidenote: To run the program, one must give the JVM more heapspace. In my case, 4096 MB were sufficient.}

176

answered Sep 17 '22 11:09

Turing85

Your testing method is ill-designed in many aspects, as well as your interpretation of the RAM rating.

Let's start with the rating; since the introduction of SDRam, marketing names the modules after their bus specification - that is the bus clock frequency, paired with the burst transfer rate. That's the best case, and in practice it can not be sustained continuously.

Parameters omitted by that label are actual access time (aka. latency) and total cycle time (aka. precharge time). These can be figured out by actually looking at the "timing" specs (the 2-3-3 stuff). Look up an article that explains that stuff in detail. Actually the CPU does not normally transfer single bytes, but entire cache lines (eg. 8 entries per 8 bytes = 64 bytes).

Your testing code is ill-designed, as you are doing random access with a relatively tiny block unaligned to actual data boundaries. This random access also incurs frequent page misses in the MMU (learn what the TLB is/does). So you are measuring a wild mixture of different system aspects.

answered Sep 21 '22 11:09

Durandal

Related questions
                            
                                Accessing HttpServletRequest object in a normal Java class from Spring
                            
                                How to turn off Netty library debug output?
                            
                                Apply '@Rule' after each '@Test' and before each '@After' in JUnit
                            
                                Why is #clone() not in the Cloneable interface?
                            
                                I want to get number of weeks in a particular month
                            
                                Lazy class cast in Java?
                            
                                Re-throwing RuntimeExceptions after catching Checked Exceptions [duplicate]
                            
                                Unmarshalling an XML using Xpath expression and jaxb
                            
                                How to convert array of byte to String in Java? [duplicate]
                            
                                Jackson serialize only interface methods
                            
                                Why isn't the Java compiler (specifically its parser) able to understand this statement
                            
                                Equivalent of internal in java
                            
                                Comparing dates with JUnit testing
                            
                                Changing Dynamic Web Module version in Eclipse Maven Project
                            
                                Implementing undo and redo for an array
                            
                                How to restrict that subclass cannot be generic?
                            
                                Is having a single threadpool better design than multiple threadpools
                            
                                Best practice for JPA with Java8's Optional return?
                            
                                Gmail REST API : 400 Bad Request + Failed Precondition
                            
                                Convert colorPrimary to colorPrimaryDark (how much darker)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Program exceeding theoretical memory transfer rate

Tags:

java

performance

memory

benchmarking

hardware

Antonio

People also ask

2 Answers

Turing85

Durandal

Recent Activity

Donate For Us