memcpy moving 128 bit in linux

Question

I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ).

There is a way to use that code inside linux ?
Does anyone know where I can found an implementation of a memcpy that moves 128 bits ?

skyking · Accepted Answer

First of all you probably use GCC as the compiler and it uses the asm statement for inline assembler. When using that you will have to use a string literal for the assembler code (which will be copied into the assembler code before sending to the assembler - this means that the string should contain newline characters).

Second you will probably have to use AT&T syntax for the assembler.

Third GCC uses extended asm to pass variables between assembler and C.

Fourth you should probably avoid inline assembler when possible anyway as the compiler wont have the possibility to schedule instructions past an asm statement (this was true at least). Instead you could maybe make use of GCC extensions like the vector_size attribute:

typedef float v4sf __attribute__((vector_size(16)));

void fubar( v4sf *p, v4sf* q )
{
  v4sf p0 = *p++;
  v4sf p1 = *p++;
  v4sf p2 = *p++;
  v4sf p3 = *p++;

  *q++ = p0;
  *q++ = p1;
  *q++ = p2;
  *q++ = p3;
}

has the advantage that the compiler will produce code even if you compile for a processor that doesn't have the mmx registers, but perhaps some other 128-bit registers (or doesn't have vector registers at all).

Fifth you should investigate if the provided memcpy isn't fast enough. Often the memcpy is really optimized.

Sixth you should take precaution if you're using special registers in the Linux kernel, there are registers that aren't saved during context switch. The SSE registers are a part of these.

Seventh as you using this to test throughput you should consider if the processor is a significant bottleneck in the equation. Compare the actual execution of the code with the reads from/writes to RAM (do you hit or miss the cache?) or the reads from/write to the peripheral.

Eighth when moving data you should avoid moving big chunks of data from RAM to RAM and if it's to/from a peripheral that has limited bandwidth you should definitely consider using DMA for that. Remember that if it's access time that limits the performance the CPU will still be considered busy (although it can't run at 100% speed).

Peter Cordes · Answer

Leaving this answer here for now, even though it's now clear the OP just wants a single 16B transfer. On Linux, his code is causing two 8B transfers over the PCIe bus.

For writing to MMIO space, it's worth trying movnti write-combining-store instructions. The source operand for movnti is a GP register, not a vector reg.

You can probably generate that with intrinsics, if you #include <immintrin.h> in your driver code. That should be fine in the kernel, as long as you're careful about what intrinsics you use. It doesn't define any globals.

So most of this section isn't very relevant.

On most CPUs (where rep movs is good), Linux's memcpy uses it. It only uses a fallback to an explicit loop for CPUs where rep movsq or rep movsb are not good choices.

When the size is a compile-time-constant, memcpy has an inline implementation using rep movsl (AT&T syntax for rep movsd), then for cleanup: non-rep movsw and movsb if needed. (Actually kinda clunky, IMO, since the size is a compile-time constant. Also doesn't take advantage of fast rep movsb on CPUs that have it.)

Intel CPUs since P6 have had at least fairly good rep movs implementations. See Andy Glew's comments on it.

But still, you're wrong about memcpy only moving in 64bit blocks, unless I'm misreading the code or you're on a platform where it decides to use the fallback loop.

Anyway, I don't think you're missing out on much perf by using the normal Linux memcpy, unless you've actually single-stepped your code and seen it doing something silly.

For large copies, you'll want to set up DMA anyway. CPU usage by your driver is important, not just the max throughput you can obtain on an otherwise-idle system. (Be careful of trusting microbenchmarks too much.)

Using SSE in the kernel means saving/restoring the vector registers. It's worth it for the RAID5/RAID6 code. That code may only run from a dedicated thread, rather than from contexts where the vector/FPU registers still have another process's data.

Linux's memcpy can be used from any context, so it avoids using anything but the usual integer registers. I did find an article about an SSE kernel memcpy patch, where Andi Kleen and Ingo Molnar both say it wouldn't be good to always use SSE for memcpy. Maybe there could be a special bulk-memcpy for big copies where it's worth saving the vector regs.

You can use SSE in the kernel, but you have to wrap it in kernel_fpu_begin() and kernel_fpu_end(). On Linux 3.7 and later, kernel_fpu_end() actually does the work of restoring FPU state, so don't use a lot of fpu_begin/fpu_end pairs in a function. Also note that kernel_fpu_begin disables pre-emption, and you must not "do anything that might fault or sleep".

In theory, saving just one vector reg, like xmm0, would be good. You'd have to make sure you used SSE, not AVX instructions, because you need to avoid zeroing the upper part of ymm0 / zmm0. You might cause an AVX+SSE stall when you return to code that was using ymm regs. Unless you want to do a full save of the vector regs, you can't run vzeroupper. And even to do that, you'd need to detect AVX support...

However, doing even this one-reg save/restore would require you to take the same precautions as kernel_fpu_begin, and disable pre-emption. Since you'd be storing to your own private save slot (prob. on the stack), rather than to task_struct.thread.fpu, I'm not sure that even disabling pre-emption is enough to guarantee that user-space FPU state won't be corrupted. Maybe it is, but maybe it isn't, and I'm not a kernel hacker. Disabling interrupts to guard against this, too, is probably worse than just using kernel_fpu_begin()/kernel_fpu_end() to trigger a full FPU state save using XSAVE/XRSTOR.

memcpy moving 128 bit in linux

Tags:

c

linux

assembly

simd

sse

haster8558

2 Answers

skyking

Peter Cordes

Recent Activity

Donate For Us

memcpy moving 128 bit in linux

Tags:

c

linux

assembly

simd

sse

haster8558

2 Answers

skyking

Peter Cordes

Related questions

Recent Activity

Donate For Us