LuaJIT FFI callback performance

Question

The LuaJIT FFI docs mention that calling from C back into Lua code is relatively slow and recommend avoiding it where possible:

Do not use callbacks for performance-sensitive work: e.g. consider a numerical integration routine which takes a user-defined function to integrate over. It's a bad idea to call a user-defined Lua function from C code millions of times. The callback overhead will be absolutely detrimental for performance.

For new designs avoid push-style APIs (C function repeatedly calling a callback for each result). Instead use pull-style APIs (call a C function repeatedly to get a new result). Calls from Lua to C via the FFI are much faster than the other way round. Most well-designed libraries already use pull-style APIs (read/write, get/put).

However, they don't give any sense of how much slower callbacks from C are. If I have some code that I want to speed up that uses callbacks, roughly how much of a speedup could I expect if I rewrote it to use a pull-style API? Does anyone have any benchmarks comparing implementations of equivalent functionality using each style of API?

Josh Parnell · Accepted Answer

Since this issue (and LJ in general) has been the source of great pain for me, I'd like to toss some extra information into the ring, in hopes that it may assist someone out there in the future.

'Callbacks' are Not Always Slow

The LuaJIT FFI documentation, when it says 'callbacks are slow,' is referring very specifically to the case of a callback created by LuaJIT and passed through FFI to a C function that expects a function pointer. This is completely different from other callback mechanisms, in particular, it has entirely different performance characteristics compared to calling a standard lua_CFunction that uses the API to invoke a callback.

With that said, the real question is then: when do we use the Lua C API to implement logic that involves pcall et al, vs. keeping everything in Lua? As always with performance, but especially in the case of a tracing JIT, one must profile (-jp) to know the answer. Period.

I have seen situations that looked similar yet fell on opposite ends of the performance spectrum; that is, I have encountered code (not toy code, but rather production code in the context of writing a high-perf game engine) that performs better when structured as Lua-only, as well as code (that seems structurally-similar) that performs better upon introducing a language boundary via calling a lua_CFunction that uses luaL_ref to maintain handles to callbacks and callback arguments.

Optimizing for LuaJIT without Measurement is a Fool's Errand

Tracing JITs are already hard to reason about, even if you're an expert in static language perf analysis. They take everything you thought you knew about performance and shatter it to pieces. If the concept of compiling recorded IR rather than compiling functions doesn't already annihilate one's ability to reason about LuaJIT performance, then the fact that calling into C via the FFI is more-or-less free when successfully JITed, yet potentially an order-of-magnitude more expensive than an equivalent lua_CFunction call when interpreted...well, this for sure pushes the situation over the edge.

Concretely, a system that you wrote last week that vastly out-performed a C equivalent may tank this week because you introduced an NYI in trace-proximity to said system, which may well have come from a seemingly-orthogonal region of code, and now your system is falling back and obliterating performance. Even worse, perhaps you're well-aware of what is and isn't an NYI, but you added just enough code to the trace proximity that it exceeded the JIT's max recorded IR instructions, max virtual registers, call depth, unroll factor, side trace limit...etc.

Also, note that, while 'empty' benchmarks can sometimes give a very general insight, it is even more important with LJ (for the aforementioned reasons) that code be profiled in context. It is very, very difficult to write representative performance benchmarks for LuaJIT, since traces are, by their nature, non-local. When using LJ in a large application, these non-local interactions become tremendously impactful.

TL;DR

There is exactly one person on this planet who really and truly understands the behavior of LuaJIT. His name is Mike Pall.

If you are not Mike Pall, do not assume anything about LJ behavior and performance. Use -jv (verbose; watch for NYIs and fallbacks), -jp (profiler! Combine with jit.zone for custom annotations; use -jp=vf to see what % of your time is being spent due in the interpreter due to fallbacks), and, when you really need to know what's going on, -jdump (trace IR & ASM). Measure, measure, measure. Take generalizations about LJ performance characteristics with a grain of salt unless they come from the man himself or you've measured them in your specific usage case (in which case, after all, it's not a generalization). And remember, the right solution might be all in Lua, it might be all in C, it might be Lua -> C through FFI, it might be Lua -> lua_CFunction -> Lua, ...you get the idea.

Coming from someone who has been fooled time-and-time-again into thinking that he has understood LuaJIT, only to be proven wrong the following week, I sincerely hope this information helps someone out there :) Personally, I simply no longer make 'educated guess' about LuaJIT. My engine outputs jv and jp logs for every run, and they are the 'word of God' for me with respect to optimization.

Miles · Answer

On my computer, a function call from LuaJIT into C has an overhead of 5 clock cycles (notably, just as fast as calling a function via a function pointer in plain C), whereas calling from C back into Lua has a 135 cycle overhead, 27x slower. That being said, program that required a million calls from C into Lua would only add ~100ms overhead to the program's runtime; while it might be worth it to avoid FFI callbacks in a tight loop that operates on mostly in-cache data, the overhead of callbacks if they're invoked, say, once per I/O operation is probably not going to be noticeable compared to the overhead of the I/O itself.

Click to copy

$ luajit-2.0.0-beta10 callback-bench.lua   
C into C          3.344 nsec/call
Lua into C        3.345 nsec/call
C into Lua       75.386 nsec/call
Lua into Lua      0.557 nsec/call
C empty loop      0.557 nsec/call
Lua empty loop    0.557 nsec/call

$ sysctl -n machdep.cpu.brand_string         
Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz

Benchmark code: https://gist.github.com/3726661

$ luajit-2.0.0-beta10 callback-bench.lua   
C into C          3.344 nsec/call
Lua into C        3.345 nsec/call
C into Lua       75.386 nsec/call
Lua into Lua      0.557 nsec/call
C empty loop      0.557 nsec/call
Lua empty loop    0.557 nsec/call

$ sysctl -n machdep.cpu.brand_string         
Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz

Benchmark code: https://gist.github.com/3726661

LuaJIT FFI callback performance

Tags:

callback

ffi

luajit

Miles

2 Answers

'Callbacks' are Not Always Slow

Optimizing for LuaJIT without Measurement is a Fool's Errand

TL;DR

Josh Parnell

Miles

Recent Activity

Donate For Us

LuaJIT FFI callback performance

Tags:

callback

ffi

luajit

Miles

2 Answers

'Callbacks' are Not Always Slow

Optimizing for LuaJIT without Measurement is a Fool's Errand

TL;DR

Josh Parnell

Miles

Related questions

Recent Activity

Donate For Us