Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?
The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP
) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.
"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without calling expensive OS task management primitives. Simply ignore the OS for thread scheduling operations. This means you have to write your own thread scheduler, and simply never pass control back to the OS. (And you have to be cleverer somehow about your thread overhead than the pretty smart OS guys). We chose this approach precisely because windows process/thread/ fiber calls were all too expensive to support computation grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language: See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism construct, and schedules such grains by a combination of a highly tuned hand-written scheduler and scheduling code generated by the PARLANSE compiler that takes into account the context of grain to minimimze scheduling overhead. For instance, the compiler ensures that the registers of a grain contain no information at the point where scheduling (e.g., "wait") might be required, and thus the scheduler code only has to save the PC and SP. In fact, quite often the scheduler code doesnt get control at all; a forked grain simply stores the forking PC and SP, switches to compiler-preallocated stack and jumps to the grain code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented by the compiler using native LOCK DEC instructions that implement what amounts to counting semaphores. Applications can fork logically millions of grains; the scheduler limits parent grains from generating more work if the work queues are long enough so more work won't be helpful. The scheduler implements work-stealing to allow work-starved CPUs to grab ready grains form neighboring CPU work queues. This has been implemented to handle up to 32 CPUs; but we're a bit worried that the x86 vendors may actually swamp use with more than that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997, and have implemented a several-million line parallel application in it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With