memcpy is usually naive - certainly not the slowest way to copy memory around, but usually quite easy to beat with some loop unrolling, and you can go even further with assembler.
[I would make this a comment, but do not have enough reputation to do so.]
I have a similar system and see similar results, but can add a few data points:
memcpy
(i.e. convert to *p_dest-- = *p_src--
), then you may get much worse performance than for the forward direction (~637 ms for me). There was a change in memcpy()
in glibc 2.12 that exposed several bugs for calling memcpy
on overlapping buffers (http://lwn.net/Articles/414467/) and I believe the issue was caused by switching to a version of memcpy
that operates backwards. So, backward versus forward copies may explain the memcpy()
/memmove()
disparity.memcpy()
implementations switch to non-temporal stores (which are not cached) for large buffers (i.e. larger than the last level cache). I tested Agner Fog's version of memcpy (http://www.agner.org/optimize/#asmlib) and found that it was approximately the same speed as the version in glibc
. However, asmlib
has a function (SetMemcpyCacheLimit
) that allows setting the threshold above which non-temporal stores are used. Setting that limit to 8GiB (or just larger than the 1 GiB buffer) to avoid the non-temporal stores doubled performance in my case (time down to 176ms). Of course, that only matched the forward-direction naive performance, so it is not stellar.memcpy
(104ms). The RAM on the Haswell system is DDR3-1600 (same as the other systems).UPDATES
/proc/cpuinfo
, the cores were then clocked at 3 GHz. However, this oddly decreased memory performance by around 10%.This looks normal to me.
Managing 8x16GB ECC memory sticks with two CPUs is a much tougher job than a single CPU with 2x2GB. Your 16GB sticks are Double sided memory + they may have buffers + ECC (even disabled on motherboard level)... all that make data path to RAM much longer. You also have 2 CPUs sharing the ram, and even if you do nothing on the other CPU there is always little memory access. Switching this data require some additional time. Just look at the enormous performance lost on PCs that share some ram with graphic card.
Still your severs are really powerfull datapumps. I'm not sure duplicating 1GB happends very often in real life software, but I'm sure that your 128GBs are much faster than any hard drive, even best SSD and this is where you can take advantage of your servers. Doing the same test with 3GB will set your laptop on fire.
This looks like the perfect example of how an architecture based on commodity hardware could be much more efficient than big servers. How many consumer PCs could one afford with the money spent on these big servers ?
Thank you for your very detailed question.
EDIT : (took me so long to write this answer that I missed the graph part.)
I think the problem is about where the data is stored. Can you please compare this :
This way you'll see how memory controller handle memory blocks far away from each other. I think that your data is put on different zones of memory and it requires a switching operation at some point on the data path to talk with one zone then the other (there's such issue with double sided memory).
Also, are you ensuring that the thread is bound to one CPU ?
EDIT 2:
There are several kind of "zones" delimiter for memory. NUMA is one, but that's not the only one. For example two sided sticks require a flag to address one side or the other. Look on your graph how the performance degrade with big chunk of memory even on the laptop (wich has no NUMA). I'm not sure of this, but memcpy may use a hardware function to copy ram (a kind of DMA) and this chip must have less cache than your CPU, this could explain why dumb copy with CPU is faster than memcpy.
It's possible that some CPU improvements in your IvyBridge-based laptop contribute to this gain over the SandyBridge-based servers.
Page-crossing Prefetch - your laptop CPU would prefetch ahead the next linear page whenever you reach the end of the current one, saving you a nasty TLB miss every time. To try and mitigate that, try building your server code for 2M / 1G pages.
Cache replacement schemes also seem to have been improved (see an interesting reverse engineering here). If indeed this CPU uses a dynamic insertion policy, it would easily prevent your copied data from trying to thrash your Last-Level-Cache (which it can't use effectively anyway due to the size), and save the room for other useful caching like code, stack, page table data, etc..). To test this, you could try rebuilding your naive implementation using streaming loads/stores (movntdq
or similar ones, you can also use gcc builtin for that). This possibility may explain the sudden drop in large data-set sizes.
I believe some improvements were also made with string-copy as well (here), it may or may not apply here, depending on how your assembly code looks like. You could try benchmarking with Dhrystone to test if there's an inherent difference. This may also explain the difference between memcpy and memmove.
If you could get hold of an IvyBridge based server or a Sandy-Bridge laptop it would be simplest to test all these together.
I modified the benchmark to use the nsec timer in Linux and found similar variation on different processors, all with similar memory. All running RHEL 6. Numbers are consistent across multiple runs.
Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, L2/L3 256K/20M, 16 GB ECC
malloc for 1073741824 took 47us
memset for 1073741824 took 643841us
memcpy for 1073741824 took 486591us
Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, L2/L3 256K/12M, 12 GB ECC
malloc for 1073741824 took 54us
memset for 1073741824 took 789656us
memcpy for 1073741824 took 339707us
Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, L2 256K/8M, 12 GB ECC
malloc for 1073741824 took 126us
memset for 1073741824 took 280107us
memcpy for 1073741824 took 272370us
Here are results with inline C code -O3
Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, 256K/20M, 16 GB
malloc for 1 GB took 46 us
memset for 1 GB took 478722 us
memcpy for 1 GB took 262547 us
Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, 256K/12M, 12 GB
malloc for 1 GB took 53 us
memset for 1 GB took 681733 us
memcpy for 1 GB took 258147 us
Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, 256K/8M, 12 GB
malloc for 1 GB took 67 us
memset for 1 GB took 254544 us
memcpy for 1 GB took 255658 us
For the heck of it, I also tried making the inline memcpy do 8 bytes at a time. On these Intel processors it made no noticeable difference. Cache merges all of the byte operations into the minimum number of memory operations. I suspect the gcc library code is trying to be too clever.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With