Poor memcpy performance in user space for mmap'ed physical memory in Linux

Tags:

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.

Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:

module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
    resmem_hwaddr >> PAGE_SHIFT,
    resmem_length, vma->vm_page_prot);
return 0;
}

and a test application, which does in essence (with the checks removed):

#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;    
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
    std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;

I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:

|      |         1GB           |          4GB           |         16GB           |        64GB            |        128GB            |         188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) |  9.326ms (1798.97MB/s) | 213.892ms (  78.43MB/s) | 206.476ms (  81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) |  4.249ms (3948.51MB/s) |  4.257ms (3941.09MB/s) |  4.298ms (3903.49MB/s) | 208.269ms (  80.55MB/s) | 200.627ms (  83.62MB/s)

My observations are:

From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.

I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)

Kind regards, peter

344

asked Apr 19 '12 21:04

PeterW

2 Answers

Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.

If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):

resmem=64G$4G   (inside CPU0 realm):   3949MB/s  
resmem=64G$96G  (outside CPU0 realm):    82MB/s  
resmem=64G$128G (outside CPU0 realm):  3948MB/s
resmem=92G$4G   (inside CPU0 realm):   3966MB/s            
resmem=92G$100G (outside CPU0 realm):    57MB/s

It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.

Regards, peter

answered Sep 29 '22 06:09

PeterW

Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.

answered Sep 29 '22 06:09

Ignacio Vazquez-Abrams

Related questions
                            
                                C++ force unloading shared library
                            
                                Windows 10 Linux Subsystem. How to install MongoDB
                            
                                Is it possible to debug Linux core (dump) files with CLion?
                            
                                tc class ceil inheritance
                            
                                How to map pthread_t to pid (on Linux)
                            
                                How to get system scale factor in X11
                            
                                Finding threading bottlenecks and optimizing for wall-time with perf
                            
                                What Linux driver subsystem/API is used for a simple screen/monitor device?
                            
                                How to redirect program output as its input
                            
                                Performance of IcedTea 6 vs Sun's HotSpot 6
                            
                                Ideal Multi-Developer Lamp Stack?
                            
                                drop/rewrite/generate keyboard events under Linux
                            
                                Linux ping broadcast switch
                            
                                linux OOM (out of memory) killer email notification?
                            
                                Linux, GNU GCC, ld, version scripts and the ELF binary format -- How does it work?
                            
                                boost::filesystem and Unicode under Linux and Windows
                            
                                clone()/fork()/process creation is slow on some machines
                            
                                Library for Dialogs and Widgets in Win32 Console Application ( in C )
                            
                                Set memory as uncacheable through x86 PAT table
                            
                                Foolproof cross-platform process kill daemon

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Poor memcpy performance in user space for mmap'ed physical memory in Linux

Tags:

linux

memory

mmap

PeterW

People also ask

2 Answers

PeterW

Ignacio Vazquez-Abrams

Recent Activity

Donate For Us