Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C - fastest method to swap two memory blocks of equal size?

What is the fastest way to swap two non-overlapping memory areas of equal size? Say, I need to swap (t_Some *a) with (t_Some *b). Considering space-time trade-off, will increased temporary space improve the speed? For example, (char *tmp) vs (int *tmp)? I am looking for a portable solution.

Prototype:

void swap_elements_of_array(void* base, size_t size_of_element, int a, int b);
like image 358
psihodelia Avatar asked Nov 17 '11 11:11

psihodelia


2 Answers

The fastest way to move a block of memory is going to be memcpy() from <string.h>. If you memcpy() from a to temp, memmove() from b to a, then memcpy() from temp to b, you’ll have a swap that uses the optimized library routines, which the compiler probably inlines. You wouldn’t want to copy the entire block at once, but in vector-sized chunks.

In practice, if you write a tight loop, the compiler can probably tell that you’re swapping every element of the arrays and optimize accordingly. On most modern CPUs, you want to generate vector instructions. It might be able to generate faster code if you make sure all three buffers are aligned.

However, what you really want to do is make things easier for the optimizer. Take this program:

#include <stddef.h>

void swap_blocks_with_loop( void* const a, void* const b, const size_t n )
{
  unsigned char* p;
  unsigned char* q;
  unsigned char* const sentry = (unsigned char*)a + n;

  for ( p = a, q = b; p < sentry; ++p, ++q ) {
     const unsigned char t = *p;
     *p = *q;
     *q = t;
  }
}

If you translate that into machine code as literally written, it’s a terrible algorithm, copying one byte at a time, doing two increments per iteration, and so on. In practice, though, the compiler sees what you’re really trying to do.

In clang 5.0.1 with -std=c11 -O3, it produces (in part) the following inner loop on x86_64:

.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movups  (%rcx,%rax), %xmm0
        movups  16(%rcx,%rax), %xmm1
        movups  (%rdx,%rax), %xmm2
        movups  16(%rdx,%rax), %xmm3
        movups  %xmm2, (%rcx,%rax)
        movups  %xmm3, 16(%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        movups  %xmm1, 16(%rdx,%rax)
        movups  32(%rcx,%rax), %xmm0
        movups  48(%rcx,%rax), %xmm1
        movups  32(%rdx,%rax), %xmm2
        movups  48(%rdx,%rax), %xmm3
        movups  %xmm2, 32(%rcx,%rax)
        movups  %xmm3, 48(%rcx,%rax)
        movups  %xmm0, 32(%rdx,%rax)
        movups  %xmm1, 48(%rdx,%rax)
        addq    $64, %rax
        addq    $2, %rsi
        jne     .LBB0_7

Whereas gcc 7.2.0 with the same flags also vectorizes, unrolling the loop less:

.L7:
        movdqa  (%rcx,%rax), %xmm0
        addq    $1, %r9
        movdqu  (%rdx,%rax), %xmm1
        movaps  %xmm1, (%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %r9, %rbx
        ja      .L7

Convincing the compiler to produce instructions that work on a single word at a time, instead of vectorizing the loop, is the opposite of what you want!

like image 147
Davislor Avatar answered Sep 18 '22 07:09

Davislor


If the 2 memory areas are large and fit in integer number of memory pages then you can swap their Page Table Entries in order to swap their contents without using memcpy() or XORs.

In theory, with two large 2MiB pages, you need to write only 16 bytes of paging structures to swap their mapping in the virtual address space ...and hence their contents, too.

1GiB pages are possible on x86-64 CPUs in 64-bit mode and content of 2 such 1GiB memory blocks can also be swapped with writing only several bytes of paging structures.

The caveat of this method is that the access to paging structures requires Kernel Mode privileges or using shared memory mapping functions from User Mode.

With recent Meltdown patches (KPTI), transitioning to Kernel Mode from User Mode has become much more expensive. Probably too expensive to make 4kiB memory page swapps competitive with memcpy()...but if you have 2MB or larger memory blocks to swap, then swapping their Paging Structures is faster.

like image 24
George Robinson Avatar answered Sep 18 '22 07:09

George Robinson