Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Program exceeding theoretical memory transfer rate

I have a laptop with Intel Core 2 Duo 2.4GHz CPU and and 2x4Gb DDR3 modules 1066MHz.

I expect that this this memory could operate at speed 1067 MiB/sec, and as long as there are two channels, maximum speed is 2134 MiB/sec (in case OS memory dispatcher will allow).

I made a tiny Java app to test that:

private static final int size = 256 * 1024 * 1024; // 256 Mb
private static final byte[] storage = new byte[size];

private static final int s = 1024; // 1Kb
private static final int duration = 10; // 10sec

public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Random rnd = new Random();
    byte[] buf1 = new byte[s];
    rnd.nextBytes(buf1);
    long count = 0;
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(buf1, 0, storage, (int) begin, s);
        ++count;
    }
    double totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    double speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");

    byte[] buf2 = new byte[s];
    count = 0;
    start = System.currentTimeMillis();
    while (System.currentTimeMillis() - start < duration * 1000) {
        long begin = (long) (rnd.nextDouble() * (size - s));
        System.arraycopy(storage, (int) begin, buf2, 0, s);
        Arrays.fill(buf2, (byte) 0);
        ++count;
    }
    totalSeconds = (System.currentTimeMillis() - start) / 1000.0;
    speed = count * s / totalSeconds / 1024 / 1024;
    System.out.println(count * s + " bytes transferred in " + totalSeconds + " secs (" + speed + " MiB/sec)");
}

I expected the result to be under 2134 MiB/sec however I have got the following:

17530212352 bytes transferred in 10.0 secs (1671.811328125 MiB/sec)
31237926912 bytes transferred in 10.0 secs (2979.080859375 MiB/sec)

How is that possible that speed was almost 3 GiB/sec?

DDR3 module photo

like image 506
Antonio Avatar asked Jul 03 '15 19:07

Antonio


People also ask

How do you calculate theoretical maximum data RAM?

The theoretical maximum memory bandwidth for Intel Core X-Series Processors can be calculated by multiplying the memory frequency (one half since double data rate x 2), multiplied by the number of the bytes of width, and multiplied by the number of the channels supported for the processor.

What is maximum memory bandwidth?

102,400,000,000 (102.4 billion) bits per second (in bytes, 12,800 MB/s or 12.8 GB/s) This theoretical maximum memory bandwidth is referred to as the "burst rate," which may not be sustainable.

How do I reduce CPU memory bandwidth?

Fetching data into a cache is usually done by prefetching which anticipates accesses to blocks of memory. Cache injection provides an alternative approach by placing incoming network data directly into a processor's cache. This technique reduces memory bandwidth by eliminating fetching data from main memory.

Is memory bandwidth important for programming?

Succinctly, high memory bandwidth is fundamental to keeping multiple devices in a system and the per-core computational units supplied with data. Once there is sufficient bandwidth to prevent data starvation, then programmers can get to work to overcome the compute bottlenecks by making changes to the software.


2 Answers

Here are multiple things at work.

First of all: the formula for memory transfer rate of DDR3 is

memory clock rate
× 4  (for bus clock multiplier)
× 2  (for data rate)
× 64 (number of bits transferred)
/ 8  (number of bits/byte)
=    memory clock rate × 64 (in MB/s)

For DDR3-1066 (which is clocked at 133⅓ MHz), we obtain a theoretical memory bandwidth8533⅓ MB/s or 8138.02083333... MiB/s for single-channel, and 17066⅔ MB/s, or 16276.0416666... MiB/s for dual-channel.

Second: transfer of one big chunk of data is faster than transfer of many small chunks of data.

Third: the test ignores caching effects, which can occur.

Fourth: if one makes time measurements, one should use System.nanoTime(). This method is more precise.

Here is a rewritten version of the test program 1.

import java.util.Random;

public class Main {

  public static void main(String... args) {
    final int SIZE = 1024 * 1024 * 1024;
    final int RUNS = 8;
    final int THREADS = 8;
    final int TSIZE = SIZE / THREADS;
    assert (TSIZE * THREADS == THREADS) : "TSIZE must divide SIZE!";
    byte[] src = new byte[SIZE];
    byte[] dest = new byte[SIZE];
    Random r = new Random();
    long timeNano = 0;

    Thread[] threads = new Thread[THREADS];
    for (int i = 0; i < RUNS; ++i) {
      System.out.print("Initializing src... ");
      for (int idx = 0; idx < SIZE; ++idx) {
        src[idx] = ((byte) r.nextInt(256));
      }
      System.out.println("done!");
      System.out.print("Starting test... ");
      for (int idx = 0; idx < THREADS; ++idx) {
        final int from = TSIZE * idx;
        threads[idx]
            = new Thread(() -> {
          System.arraycopy(src, from, dest, 0, TSIZE);
        });
      }
      long start = System.nanoTime();
      for (int idx = 0; idx < THREADS; ++idx) {
        threads[idx].start();
      }
      for (int idx = 0; idx < THREADS; ++idx) {
        try {
          threads[idx].join();
        } catch (InterruptedException e) {
          e.printStackTrace();
        }
      }
      timeNano += System.nanoTime() - start;
      System.out.println("done!");
    }
    double timeSecs = timeNano / 1_000_000_000d;

    System.out.println("Transfered " + (long) SIZE * RUNS
        + " bytes in " + timeSecs + " seconds.");

    System.out.println("-> "
        + ((long) SIZE * RUNS / timeSecs / 1024 / 1024 / 1024)
        + " GiB/s");
  }
}

This way, as much "other computation" as possible is mitigated and (almost) only memory copy rate via System.arraycopy(...) is measured. This algorithm may still have issues with regards to caching.

For my system (Dual Channel DDR3-1600), I get something around 6 GiB/s, whereas the theoretical limit is around 25 GiB/s (including DualChannel).

As was pointed out by Nick Mertin, the JVM introduces some overhead. Therefore, it is expected that you are not able to reach the theoretical limit.


1 Sidenote: To run the program, one must give the JVM more heapspace. In my case, 4096 MB were sufficient.

like image 176
Turing85 Avatar answered Sep 17 '22 11:09

Turing85


Your testing method is ill-designed in many aspects, as well as your interpretation of the RAM rating.

Let's start with the rating; since the introduction of SDRam, marketing names the modules after their bus specification - that is the bus clock frequency, paired with the burst transfer rate. That's the best case, and in practice it can not be sustained continuously.

Parameters omitted by that label are actual access time (aka. latency) and total cycle time (aka. precharge time). These can be figured out by actually looking at the "timing" specs (the 2-3-3 stuff). Look up an article that explains that stuff in detail. Actually the CPU does not normally transfer single bytes, but entire cache lines (eg. 8 entries per 8 bytes = 64 bytes).

Your testing code is ill-designed, as you are doing random access with a relatively tiny block unaligned to actual data boundaries. This random access also incurs frequent page misses in the MMU (learn what the TLB is/does). So you are measuring a wild mixture of different system aspects.

like image 34
Durandal Avatar answered Sep 21 '22 11:09

Durandal