Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gcc inline assembly using modifier "P" and constraint "p" over "m" in Linux kernel

I'm reading Linux kernel source code (3.12.5 x86_64) to understand how process descriptor is handled.

I found to get current process descriptor I could use current_thread_info() function, which is implemented as follows:

static inline struct thread_info *current_thread_info(void)
{
    struct thread_info *ti;
    ti = (void *)(this_cpu_read_stable(kernel_stack) +
         KERNEL_STACK_OFFSET - THREAD_SIZE);
    return ti;
}

Then I looked into this_cpu_read_stable():

#define this_cpu_read_stable(var)       percpu_from_op("mov", var, "p" (&(var)))

#define percpu_from_op(op, var, constraint) \
({ \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
...
case 8: \
    asm(op "q "__percpu_arg(1)",%0" \
    : "=r" (pfo_ret__) \
    : constraint); \
    break; \
default: __bad_percpu_size(); \
} \
pfo_ret__; \
})

#define __percpu_arg(x)         __percpu_prefix "%P" #x

#ifdef CONFIG_SMP
#define __percpu_prefix "%%"__stringify(__percpu_seg)":"
#else
#define __percpu_prefix ""
#endif

#ifdef CONFIG_X86_64
#define __percpu_seg gs
#else
#define __percpu_seg fs
#endif

The expanded macro should be inline asm code like this:

asm("movq %%gs:%P1,%0" : "=r" (pfo_ret__) : "p"(&(kernel_stack))); 

According to this post the input constraint used to be "m"(kernel_stack), which makes sense to me. But obviously to improve performance Linus changed the constraint to "p" and passed the address of variable:

It uses a "p" (&var) constraint instead of a "m" (var) one, to make gcc 
think there is no actual "load" from memory. This obviously _only_ works 
for percpu variables that are stable within a thread, but 'current' and 
'kernel_stack' should be that way.

Also in post Tejun Heo made this comments:

Added the magical undocumented "P" modifier to UP __percpu_arg()
to force gcc to dereference the pointer value passed in via the
"p" input constraint.  Without this, percpu_read_stable() returns
the address of the percpu variable.  Also added comment explaining
the difference between percpu_read() and percpu_read_stable().

But my experiments with combining modifier "P" modifier and constraint "p(&var)" did not work. If segment register is not specified, "%P1" always returns the address of the variable. The pointer was not dereferenced. I have to use a bracket to dereference it, like "(%P1)". If segment register is specified, without bracket gcc won't even compile. My test code is as follows:

#include <stdio.h>

#define current(var) ({\
        typeof(var) pfo_ret__;\
        asm(\
                "movq %%es:%P1, %0\n"\
                : "=r"(pfo_ret__)\
                : "p" (&(var))\
        );\
        pfo_ret__;\
        })

int main () {
        struct foo {
                int field1;
                int field2;
        } a = {
                .field1 = 100,
                .field2 = 200,
        };
        struct foo *var = &a;

        printf ("field1: %d\n", current(var)->field1);
        printf ("field2: %d\n", current(var)->field2);

        return 0;
}

Is there anything wrong with my code? Or do I need to append some options for gcc? Also when I used gcc -S to generate assembly code I didn't see optimization by using "p" over "m". Any answer or comments is much appreciated.

like image 464
silmerusse Avatar asked Jan 07 '14 06:01

silmerusse


1 Answers

The reason why your example code doesn't work is because the "p" constraint is only of a very limited use in inline assembly. All inline assembly operands have the requirement that they be representable as an operand in assembly language. If the operand isn't representable than compiler makes it so by moving it to a register first and substituting that as the operand. The "p" constraint places an additional restriction: the operand must be a valid address. The problem is that a register isn't a valid address. A register can contain an address but a register is not itself an valid address.

That means the operand of the "p" constraint must be have a valid assembly representation as is and be a valid address. You're trying to use the address of a variable on the stack as the operand. While this is a valid address, it's not a valid operand. The stack variable itself has a valid representation (something like 8(%rbp)), but the address of the stack variable doesn't. (If it were representable it would be something like 8 + %rbp, but this isn't a legal operand.)

One of the few things that you can take the address of and use as an operand with the "p" constraint is a statically allocated variable. In this case it's a valid assembly operand, as it can be represented as an immediate value (eg. &kernel_stack can be represented as $kernel_stack). It's also a valid address and so satisfies the constraint.

So that's why Linux kernel macro works and you macro doesn't. You're trying to use it with stack variables, while the kernel only uses it with statically allocated variables.

Or at least what looks like a statically allocated variabvle to the compiler. In fact kernel_stack is actually allocated in a special section used for per CPU data. This section doesn't actually exist, instead it's used as a template to create a separate region of memory for each CPU. The offset of kernel_stack in this special section is used as the offset in each per CPU data region to store a separate kernel stack value for each CPU. The FS or GS segment register is used as the base of this region, each CPU using a different address as the base.

So that's why the Linux kernel use inline assembly to access what otherwise looks like a static variable. The macro is used to turn the static variable into a per CPU variable. If you're not trying to do something like this then you probably don't have anything to gain by copying from the kernel macro. You should probably be considering a different way to do what you're trying accomplish.

Now if you're thinking since Linus Torvalds has come with this optimization in the kernel to replace an "m" constraint with a "p" it must be a good idea to do this generally, you should be very aware how fragile this optimization is. What its trying to do is fool GCC into thinking that reference to kernel_stack doesn't actually access memory, so that it won't keep reloading the value every time it changes memory. The danger here is that if kernel_stack does change then the compiler will be fooled, and continue to use the old value. Linus knows when and how the per CPU variables are changed, and so can be confident that the macro is safe when used for its intended purpose in the kernel.

If you want eliminate redundant loads in your own code, I suggest using -fstrict-aliasing and/or the restrict keyword. That way you're not dependant on a fragile and non-portable inline assembly macros.

like image 97
Ross Ridge Avatar answered Oct 17 '22 21:10

Ross Ridge