Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lua coroutines -- setjmp longjmp clobbering?

In a blog post from not too long ago, Scott Vokes describes a technical problem associated to lua's implementation of coroutines using the C functions setjmp and longjmp:

The main limitation of Lua coroutines is that, since they are implemented with setjmp(3) and longjmp(3), you cannot use them to call from Lua into C code that calls back into Lua that calls back into C, because the nested longjmp will clobber the C function’s stack frames. (This is detected at runtime, rather than failing silently.)

I haven’t found this to be a problem in practice, and I’m not aware of any way to fix it without damaging Lua’s portability, one of my favorite things about Lua — it will run on literally anything with an ANSI C compiler and a modest amount of space. Using Lua means I can travel light. :)

I have used coroutines a fair amount and I thought I understood broadly what was going on and what setjmp and longjmp do, however I read this at some point and realized that I didn't really understand it. To try to figure it out, I tried to make a program that I thought should cause a problem based on the description, and instead it seems to work fine.

However there are a few other places that I've seen people seem to allege that there are problems:

  • http://coco.luajit.org/
  • http://lua-users.org/lists/lua-l/2005-03/msg00179.html

The question is:

  • Under what circumstances do lua coroutines fail to work because of C function stack frames getting clobbered?
  • What exactly is the result? Does "detected at runtime" mean, lua panic? Or something else?
  • Does this still affect the most recent versions of lua (5.3) or is this actually a 5.1 issue or something?

Here was the code which I produced. In my test, it is linked with lua 5.3.1, compiled as C code, and the test itself is compiled itself as C++ code at C++11 standard.

extern "C" {
#include <lauxlib.h>
#include <lua.h>
}

#include <cassert>
#include <iostream>

#define CODE(C) \
case C: { \
  std::cout << "When returning to " << where << " got code '" #C "'" << std::endl; \
  break; \
}

void handle_resume_code(int code, const char * where) {
  switch (code) {
    CODE(LUA_OK)
    CODE(LUA_YIELD)
    CODE(LUA_ERRRUN)
    CODE(LUA_ERRMEM)
    CODE(LUA_ERRERR)
    default:
      std::cout << "An unknown error code in " << where << std::endl;
  }
}

int trivial(lua_State *, int, lua_KContext) {
  std::cout << "Called continuation function" << std::endl;
  return 0;
}

int f(lua_State * L) {
  std::cout << "Called function 'f'" << std::endl;
  return 0;
}

int g(lua_State * L) {
  std::cout << "Called function 'g'" << std::endl;

  lua_State * T = lua_newthread(L);
  lua_getglobal(T, "f");

  handle_resume_code(lua_resume(T, L, 0), __func__);
  return lua_yieldk(L, 0, 0, trivial);
}

int h(lua_State * L) {
  std::cout << "Called function 'h'" << std::endl;

  lua_State * T = lua_newthread(L);
  lua_getglobal(T, "g");

  handle_resume_code(lua_resume(T, L, 0), __func__);
  return lua_yieldk(L, 0, 0, trivial);
}

int main () {
  std::cout << "Starting:" << std::endl;

  lua_State * L = luaL_newstate();

  // init
  {
    lua_pushcfunction(L, f);
    lua_setglobal(L, "f");

    lua_pushcfunction(L, g);
    lua_setglobal(L, "g");

    lua_pushcfunction(L, h);
    lua_setglobal(L, "h");
  }

  assert(lua_gettop(L) == 0);

  // Some action
  {
    lua_State * T = lua_newthread(L);
    lua_getglobal(T, "h");

    handle_resume_code(lua_resume(T, nullptr, 0), __func__);
  }

  lua_close(L); 

  std::cout << "Bye! :-)" << std::endl;
}

The output I get is:

Starting:
Called function 'h'
Called function 'g'
Called function 'f'
When returning to g got code 'LUA_OK'
When returning to h got code 'LUA_YIELD'
When returning to main got code 'LUA_YIELD'
Bye! :-)

Much thanks to @ Nicol Bolas for the very detailed answer!
After reading his answer, reading the official docs, reading some emails and playing around with it some more, I want to refine the question / ask a specific follow-up question, however you want to look at it.

I think this term 'clobbering' is not good for describing this issue and this was part of what confused me -- nothing is being "clobbered" in the sense of being written to twice and the first value being lost, the issue is solely, as @Nicol Bolas points out, that longjmp tosses part of the C stack, and if you are hoping to restore the stack later, too bad.

The issue is actually described very nicely in section 4.7 of lua 5.2 manual, in a link provided by @Nicol Bolas.

Curiously, there is no equivalent section in the lua 5.1 documentation. However, lua 5.2 has this to say about lua_yieldk:

Yields a coroutine.

This function should only be called as the return expression of a C function, as follows:

return lua_yieldk (L, n, i, k);

Lua 5.1 manual says something similar, about lua_yield instead:

Yields a coroutine.

This function should only be called as the return expression of a C function, as follows:

return lua_yieldk (L, n, i, k);

Some natural questions then:

  • Why does it matter if I use return here or not? If lua_yieldk will call longjmp then the lua_yieldk will never return anyways, so it shouldn't matter if I return then? So that cannot be what is happening, right?
  • Supposing instead that lua_yieldk just makes a note within the lua state that the current C api call has stated that it wants to yield, and then when it finally does return, lua will figure out what happens next. Then this solves the problem of saving C stack frames, no? Since after we return to lua normally, those stack frames have expired anyways -- so the complications described in @Nicol Bolas picture are skirted around? And second of all, in 5.2 at least the semantics are never that we should restore C stack frames, it seems -- lua_yieldk resumes to a continuation function, not to the lua_yieldk caller, and lua_yield apparently resumes to the caller of the current api call, not to the lua_yield caller itself.

And, the most important question:

If I consistently use lua_yieldk in the form return lua_yieldk(...) specified in the docs, returning from a lua_CFunction that was passed to lua, is it still possible to trigger the attempt to yield across a C-call boundary error?

Finally, (but this is less important), I would like to see a concrete example of what it looks like when a naive programmer "isn't careful" and triggers the attempt to yield across a C-call boundary error. I get the idea that there could be problem associated to setjmp and longjmp tossing stack frames that we later need, but I want to see some real lua / lua c api code that I can point to and say "for instance, don't do that", and this is surprisingly elusive.

I found this email where someone reported this error with some lua 5.1 code, and I attempted to reproduce it in lua 5.3. However what I found was that, this looks like just poor error reporting from the lua implementation -- the actual bug is being caused because the user is not setting up their coroutine properly. The proper way to load the coroutine is, create the thread, push a function onto the thread stack, and then call lua_resume on the thread state. Instead the user was using dofile on the thread stack, which executes the function there after loading it, rather than resuming it. So it is effectively yield outside of a coroutine iiuc, and when I patch this, his code works fine, using both lua_yield and lua_yieldk in lua 5.3.

Here is the listing I produced:

#include <cassert>
#include <cstdio>

extern "C" {
#include "lua.h"
#include "lauxlib.h"
}

//#define USE_YIELDK

bool running = true;

int lua_print(lua_State * L) {
  if (lua_gettop(L)) {
    printf("lua: %s\n", lua_tostring(L, -1));
  }
  return 0;
}

int lua_finish(lua_State *L) {
  running = false;
  printf("%s called\n", __func__);
  return 0;
}

int trivial(lua_State *, int, lua_KContext) {
  printf("%s called\n", __func__);
  return 0;
}

int lua_sleep(lua_State *L) {
  printf("%s called\n", __func__);
#ifdef USE_YIELDK
  printf("Calling lua_yieldk\n");
  return lua_yieldk(L, 0, 0, trivial);
#else
  printf("Calling lua_yield\n");
  return lua_yield(L, 0);
#endif
}

const char * loop_lua =
"print(\"loop.lua\")\n"
"\n"
"local i = 0\n"
"while true do\n"
"  print(\"lua_loop iteration\")\n"
"  sleep()\n"
"\n"
"  i = i + 1\n"
"  if i == 4 then\n"
"    break\n"
"  end\n"
"end\n"
"\n"
"finish()\n";

int main() {
  lua_State * L = luaL_newstate();

  lua_pushcfunction(L, lua_print);
  lua_setglobal(L, "print");

  lua_pushcfunction(L, lua_sleep);
  lua_setglobal(L, "sleep");

  lua_pushcfunction(L, lua_finish);
  lua_setglobal(L, "finish");

  lua_State* cL = lua_newthread(L);
  assert(LUA_OK == luaL_loadstring(cL, loop_lua));
  /*{
    int result = lua_pcall(cL, 0, 0, 0);
    if (result != LUA_OK) {
      printf("%s error: %s\n", result == LUA_ERRRUN ? "Runtime" : "Unknown", lua_tostring(cL, -1));
      return 1;
    }
  }*/
  // ^ This pcall (predictably) causes an error -- if we try to execute the
  // script, it is going to call things that attempt to yield, but we did not
  // start the script with lua_resume, we started it with pcall, so it's not
  // okay to yield.
  // The reported error is "attempt to yield across a C-call boundary", but what
  // is really happening is just "yield from outside a coroutine" I suppose...

  while (running) {
    int status;
    printf("Waking up coroutine\n");
    status = lua_resume(cL, L, 0);
    if (status == LUA_YIELD) {
      printf("coroutine yielding\n");
    } else {
      running = false; // you can't try to resume if it didn't yield

      if (status == LUA_ERRRUN) {
        printf("Runtime error: %s\n", lua_isstring(cL, -1) ? lua_tostring(cL, -1) : "(unknown)" );
        lua_pop(cL, -1);
        break;
      } else if (status == LUA_OK) {
        printf("coroutine finished\n");
      } else {
        printf("Unknown error\n");
      }
    }
  }

  lua_close(L);
  printf("Bye! :-)\n");
  return 0;
}

Here is the output when USE_YIELDK is commented out:

Waking up coroutine
lua: loop.lua
lua: lua_loop iteration
lua_sleep called
Calling lua_yield
coroutine yielding
Waking up coroutine
lua: lua_loop iteration
lua_sleep called
Calling lua_yield
coroutine yielding
Waking up coroutine
lua: lua_loop iteration
lua_sleep called
Calling lua_yield
coroutine yielding
Waking up coroutine
lua: lua_loop iteration
lua_sleep called
Calling lua_yield
coroutine yielding
Waking up coroutine
lua_finish called
coroutine finished
Bye! :-)

Here is the output when USE_YIELDK is defined:

Waking up coroutine
lua: loop.lua
lua: lua_loop iteration
lua_sleep called
Calling lua_yieldk
coroutine yielding
Waking up coroutine
trivial called
lua: lua_loop iteration
lua_sleep called
Calling lua_yieldk
coroutine yielding
Waking up coroutine
trivial called
lua: lua_loop iteration
lua_sleep called
Calling lua_yieldk
coroutine yielding
Waking up coroutine
trivial called
lua: lua_loop iteration
lua_sleep called
Calling lua_yieldk
coroutine yielding
Waking up coroutine
trivial called
lua_finish called
coroutine finished
Bye! :-)
like image 911
Chris Beck Avatar asked Dec 16 '15 03:12

Chris Beck


2 Answers

Think about what happens when a coroutine does a yield. It stops executing, and processing returns to whomever it was that called resume on that coroutine, correct?

Well, let's say you have this code:

function top()
    coroutine.yield()
end

function middle()
    top()
end

function bottom()
    middle()
end

local co = coroutine.create(bottom);

coroutine.resume(co);

At the moment of the call to yield, the Lua stack looks like this:

-- top
-- middle
-- bottom
-- yield point

When you call yield, the Lua call stack that is part of the coroutine is preserved. When you do resume, the preserved call stack is executed again, starting where it left off before.

OK, now let's say that middle was in fact not a Lua function. Instead, it was a C function, and that C function calls the Lua function top. So conceptually, your stack looks like this:

-- Lua - top
-- C   - middle
-- Lua - bottom
-- Lua - yield point

Now, please note what I said before: this is what your stack looks like conceptually.

Because your actual call stack looks nothing like this.

In reality, there are really two stacks. There is Lua's internal stack, defined by a lua_State. And there's C's stack. Lua's internal stack, at the time when yield is about to be called, looks something like this:

-- top
-- Some C stuff
-- bottom
-- yield point

So what does the stack look like to C? Well, it looks like this:

-- arbitrary Lua interpreter stuff
-- middle
-- arbitrary Lua interpreter stuff
-- setjmp

And that right there is the problem. See, when Lua does a yield, it's going to call longjmp. That function is based on the behavior of the C stack. Namely, it's going to return to where setjmp was.

The Lua stack will be preserved because the Lua stack is separate from the C stack. But the C stack? Everything between the longjmp and setjmp?. Gone. Kaput. Lost forever.

Now you may go, "wait, doesn't the Lua stack know that it went into C and back into Lua"? A bit. But the Lua stack is incapable of doing something that C is incapable of. And C is simply not capable of preserving a stack (well, not without special libraries). So while the Lua stack is vaguely aware that some kind of C process happened in the middle of its stack, it has no way to reconstitute what was there.

So what happens if you resume this yielded coroutine?

Nasal demons. And nobody likes those. Fortunately, Lua 5.1 and above (at least) will error whenever you attempt to yield across C.

Note that Lua 5.2+ does have ways of fixing this. But it's not automatic; it requires explicit coding on your part.

When Lua code that is in a coroutine calls your C code, and your C code calls Lua code that may yield, you can use lua_callk or lua_pcallk to call the possibly-yielding Lua functions. These calling functions take an extra parameter: a "continuation" function.

If the Lua code you call does yield, then the lua_*callk function won't ever actually return (since your C stack will have been destroyed). Instead, it will call the continuation function you provided in your lua_*callk function. As you can guess by the name, the continuation function's job is to continue where your previous function left off.

Now, Lua does preserve the stack for your continuation function, so it gets the stack in the same state that your original C function was in. Well, except that the function+arguments that you called (with lua_*callk) are removed, and the return values from that function are pushed onto your stack. Outside of that, the stack is all the same.

There is also lua_yieldk. This allows your C function to yield back to Lua, such that when the coroutine is resumed, it calls the provided continuation function.

Note that Coco gives Lua 5.1 the ability to resolve this problem. It is capable (though OS/assembly/etc magic) of preserving the C stack during a yield operation. LuaJIT versions before 2.0 also provided this feature.


C++ note

You marked your question with the C++ tag, so I'll assume that's involved here.

Among the many differences between C and C++ is the fact that C++ is far more dependent on the nature of its callstack than Lua. In C, if you discard a stack, you might lose resources that weren't cleaned up. C++ however is required to call destructors of functions declared on the stack at some point. The standard does not allow you to just throw them away.

So continuations only work in C++ if there is nothing on the stack which needs to have a destructor call. Or more specifically, only types that are trivially destructible can be sitting on the stack if you call any of the continuation function Lua APIs.

Of course, Coco handles C++ just fine, since it's actually preserving the C++ stack.

like image 172
Nicol Bolas Avatar answered Oct 04 '22 17:10

Nicol Bolas


Posting this as an answer which complements @Nicol Bolas' answer, and so that I can have space to write down what it took for me to understand the original question, and the answers to the secondary questions / a code listing.

If you read Nicol Bolas' answer but still have questions like I did, here are some additional hints:

  • The three layers on the call stack, Lua, C, Lua, are essential to the problem. If you only have two layers, Lua and C, you don't get the problem.
  • In imagining how the coroutine call is supposed to work -- the lua stack looks a certain way, the C stack looks a certain way, the call yields (longjmp) and later is resumed... the problem does not happen immediately when it is resumed.
    The problem happens when the resumed function later tries to return, to your C function.
    Because, for the coroutine semantics to work out, it is supposed to return into a C function call, but the stack frames for that are gone, and cannot be restored.
  • The workaround for this lack of ability to restore those stack frames is to use lua_callk, lua_pcallk, which allow you to provide a substitute function which can be called in place of that C function whose frames were wiped out.
  • The issue about return lua_yieldk(...) appears to have nothing to do with any of this. From skimming the implementation of lua_yieldk it appears that it does indeed always longjmp, and it may only return in some obscure case involving lua debugging hooks (?).
  • Lua internally (at current version) keeps track of when yield should not be allowed, by keeping a counter variable nny (number non-yieldable) associated to the lua state, and when you call lua_call or lua_pcall from a C api function (a lua_CFunction which you earlier pushed to lua), nny is incremented, and is only decremented when that call or pcall returns. When nny is nonzero, it is not safe to yield, and you get this yield across C-api boundary error if you try to yield anyways.

Here is a simple listing that produces the problem and reports the errors, if you are like me and like to have a concrete code examples. It demonstrates some of the difference in using lua_call, lua_pcall, and lua_pcallk within a function called by a coroutine.

extern "C" {
#include <lauxlib.h>
#include <lua.h>
}

#include <cassert>
#include <iostream>

//#define USE_PCALL
//#define USE_PCALLK

#define CODE(C) \
case C: { \
  std::cout << "When returning to " << where << " got code '" #C "'" << std::endl; \
  break; \
}

#define ERRCODE(C) \
case C: { \
  std::cout << "When returning to " << where << " got code '" #C "': " << lua_tostring(L, -1) << std::endl; \
  break; \
}

int report_resume_code(int code, const char * where, lua_State * L) {
  switch (code) {
    CODE(LUA_OK)
    CODE(LUA_YIELD)
    ERRCODE(LUA_ERRRUN)
    ERRCODE(LUA_ERRMEM)
    ERRCODE(LUA_ERRERR)
    default:
      std::cout << "An unknown error code in " << where << ": " << lua_tostring(L, -1) << std::endl;
  }
  return code;
}

int report_pcall_code(int code, const char * where, lua_State * L) {
  switch(code) {
    CODE(LUA_OK)
    ERRCODE(LUA_ERRRUN)
    ERRCODE(LUA_ERRMEM)
    ERRCODE(LUA_ERRERR)
    default:
      std::cout << "An unknown error code in " << where << ": " << lua_tostring(L, -1) << std::endl;
  }
  return code;
}

int trivial(lua_State *, int, lua_KContext) {
  std::cout << "Called continuation function" << std::endl;
  return 0;
}

int f(lua_State * L) {
  std::cout << "Called function 'f', yielding" << std::endl;
  return lua_yield(L, 0);
}

int g(lua_State * L) {
  std::cout << "Called function 'g'" << std::endl;

  lua_getglobal(L, "f");
#ifdef USE_PCALL
  std::cout  << "pcall..." << std::endl;
  report_pcall_code(lua_pcall(L, 0, 0, 0), __func__, L);
  // ^ yield across pcall!
  // If we yield, there is no way ever to return normally from this pcall,
  // so it is an error.
#elif defined(USE_PCALLK)
  std::cout  << "pcallk..." << std::endl;
  report_pcall_code(lua_pcallk(L, 0, 0, 0, 0, trivial), __func__, L);
#else
  std::cout << "call..." << std::endl;
  lua_call(L, 0, 0);
  // ^ yield across call!
  // This results in an error being reported in lua_resume, rather than at
  // the pcall
#endif
  return 0;
}

int main () {
  std::cout << "Starting:" << std::endl;

  lua_State * L = luaL_newstate();

  // init
  {
    lua_pushcfunction(L, f);
    lua_setglobal(L, "f");

    lua_pushcfunction(L, g);
    lua_setglobal(L, "g");
  }

  assert(lua_gettop(L) == 0);

  // Some action
  {
    lua_State * T = lua_newthread(L);
    lua_getglobal(T, "g");

    while (LUA_YIELD == report_resume_code(lua_resume(T, L, 0), __func__, T)) {}
  }

  lua_close(L); 

  std::cout << "Bye! :-)" << std::endl;
}

Example output:

call

Starting:
Called function 'g'
call...
Called function 'f', yielding
When returning to main got code 'LUA_ERRRUN': attempt to yield across a C-call boundary
Bye! :-)

pcall

Starting:
Called function 'g'
pcall...
Called function 'f', yielding
When returning to g got code 'LUA_ERRRUN': attempt to yield across a C-call boundary
When returning to main got code 'LUA_OK'
Bye! :-)

pcallk

Starting:
Called function 'g'
pcallk...
Called function 'f', yielding
When returning to main got code 'LUA_YIELD'
Called continuation function
When returning to main got code 'LUA_OK'
Bye! :-)
like image 44
Chris Beck Avatar answered Oct 04 '22 19:10

Chris Beck