TCL performance: One billion loop

Question

I was following the 1 billion iteration challenge launched on Twitter (https://github.com/bddicken/languages), so I decided to see how TCL would behave. However, I had a very unpleasant surprise. While the python code took 91.16s on my machine, the equivalent TCL code took 778.70s. Below are the two codes. The python code was taken from the challenge repository. I'm not a big TCL expert, so the question is: Is there any optimization possible?

TCL code.

set u [expr {int([lindex $argv 0])}]
set r [expr {int(rand()*10000)}]
array set a {}

for {set i 0} {$i < 10000} {incr i} {
    set a($i) 0
    for {set j 0} {$j < 100000} {incr j} {
      set a($i) [expr {$a($i) + $j % $u}]
    }
    set a($i) [expr {$a($i) + $r}]
}

puts $a($r)

Python code

import sys
import random

def main():
    u = int(sys.argv[1])         # Get an input number from the command line
    r = random.randint(0, 10000) # Get a random number 0 <= r < 10k
    a  = [0] * 10000             # Array of 10k elements initialized to 0
    for i in range(10000):       # 10k outer loop iterations
      for j in range(100000):    # 100k inner loop iterations, per outer loop iteration
        a[i] += j%u              # Simple sum
      a[i] += r                  # Add a random value to each element in array
    print(a[r])                  # Print out a single element from the arra
main()

Python 3.12.7
TCL 8.6 (little changes between tcl 8.6 and 9)
My machine is an Aspire 3 notebook, Ryzen 7 with 24G of RAM
O.S Arch Linux

EDIT:

I tried some code changes without major performance changes, such as:

Use of list
Use of array
Use of for
Use of while
Switch TCL version.

I don't want them to "optimize my code", because this code is somewhat useless and won't change the way I use TCL in my tasks. But as I already read somewhere that TCL and Python have similar performance, I found it strange that such basic code would show such a difference. I think it may be some details in my implementation.

So I would like to know what those more experienced in TCL can teach about this.

Cyan Ogilvie · Accepted Answer

I love Tcl, but this type of language benchmark is testing a use case that Tcl shouldn't be used for. That's not to say that most of your system shouldn't be Tcl (if that makes sense), just that tight inner loops like this that boil down to a few integer instructions are completely unsuitable for this sort of language.

If you find yourself snowed in one day with nothing to do - make a lot of coffee (in a thermos, this will take a while) and step through the c code in a debugger as the Tcl interp runs just one iteration of each of the nested loops (set the iteration count to 1 and 1). I've done this (or at least for similar cases), and I think you'll leave with a deep appreciation of the cost of the abstraction layer that Tcl is presenting. What will also become painfully apparent is that most of the work that is actually done in an example like this is managing the bytecode execution - a vanishingly small portion of the work is the actual computation that the code performs.

The reason that this kind of benchmark is useless to evaluate Tcl performance is that the results you get from it don't extrapolate to meaningful performance insights into real work that you'd actually do in Tcl, even if you, like me, reach for Tcl first, and in more situations than may be strictly justifiable.

Also note that if you run Tcl code like this in a docker environment, the performance is MUCH worse - around a factor of 2 in my testing. That comes from docker enabling some speculative execution vulnerability mitigations that aren't enabled by default on Linux (because their performance impact is large). Any language with the kind of conditional-jumps-based bytecode dispatch that Tcl uses is affected by this (python too), but Tcl is especially hard hit.

Rant aside, if the computation in question was somehow needed in the context of some bigger Tcl codebase, I'd drop to C just for that bit. C expresses that code as clearly and tersely as Tcl, so there is no complexity or comprehensibility advantage to implementing it in the higher level language. From the very beginnings of the language it was designed to play nicely with C, so there is very little friction to doing that. Mainly the friction comes from having to pre-compile the C extension for each platform you use, but I built a package to take care of all that (jitc). There are others like it (tcctcl, critcl). Doing that, and comparing the performance yields some interesting insights:

package require jitc    ;# Just-In-Time C: https://github.com/cyanogilvie/jitc

proc tclbusy {n r} {
    set a {}

    for {set i 0} {$i < 10000} {incr i} {
        set x 0
        for {set j 0} {$j < 100000} {incr j} {
            incr x [expr {$j % $n}]
        }
        lappend a [incr x $r]
    }

    lindex $a $r
}

set compile_start   [clock microseconds]
jitc::bind busybusybusy {
    code {
        OBJCMD(busywork) {
            int         code = TCL_OK;
            Tcl_WideInt n, r;
            Tcl_Obj*    values = NULL;

            enum {A_cmd, A_N, A_R, A_objc};
            CHECK_ARGS_LABEL(finally, code, "u r");
            TEST_OK_LABEL(finally, code, Tcl_GetWideIntFromObj(interp, objv[A_N], &n));
            TEST_OK_LABEL(finally, code, Tcl_GetWideIntFromObj(interp, objv[A_R], &r));

            replace_tclobj(&values, Tcl_NewListObj(10000, NULL));
            for (int i=0; i<10000; i++) {
                Tcl_WideInt x = 0;
                for (int j=0; j<100000; j++) x += j % n;
                x += r;
                TEST_OK_LABEL(finally, code, Tcl_ListObjAppendElement(interp, values, Tcl_NewWideIntObj(x)));
            }

            Tcl_Obj*    chosen = NULL;
            TEST_OK_LABEL(finally, code, Tcl_ListObjIndex(interp, values, r, &chosen));
            Tcl_SetObjResult(interp, chosen);

        finally:
            replace_tclobj(&values, NULL);
            return code;
        }
    }
} busywork

jitc::bind busybusybusy2 {
    code {
        OBJCMD(busywork) {
            int             code = TCL_OK;
            Tcl_WideInt     n, r;
            Tcl_WideInt*    values = NULL;

            enum {A_cmd, A_N, A_R, A_objc};
            CHECK_ARGS_LABEL(finally, code, "u r");
            TEST_OK_LABEL(finally, code, Tcl_GetWideIntFromObj(interp, objv[A_N], &n));
            TEST_OK_LABEL(finally, code, Tcl_GetWideIntFromObj(interp, objv[A_R], &r));

            values = ckalloc(10000 * sizeof Tcl_WideInt);
            for (int i=0; i<10000; i++) {
                Tcl_WideInt x = 0;
                for (int j=0; j<100000; j++) x += j % n;
                x += r;
                values[i] = x;
            }

            Tcl_SetObjResult(interp, Tcl_NewWideIntObj(values[r]));

        finally:
            if (values) {
                ckfree(values);
                values = NULL;
            }
            return code;
        }
    }
} busywork
puts "Compile 2 c versions: [format %.3f [expr {([clock microseconds]-$compile_start)/1e3}]] ms"

set r   [expr {int(rand()*10000)}]

puts "
Tcl version: accumulate state in a list with lappend:"
puts [timerate {
    puts [tclbusy [lindex $argv 0] $r]
}]

puts "
Accumulate state in a Tcl list:"
puts [timerate {
    puts [busybusybusy [lindex $argv 0] $r]
}]

puts "
Accumulate state in a c array:"
puts [timerate {
    puts [busybusybusy2 [lindex $argv 0] $r]
}]

Produces (in an optimised Tcl 8.7 interp on my i7-12800H laptop, Linux, not containerised):

Compile 2 c versions: 9.373 ms

Tcl version: accumulate state in a list with lappend:
2050558
39840865 µs/# 1 # 0.025 #/sec 39840.865 net-ms

Accumulate state in a Tcl list:
2050558
2093536 µs/# 1 # 0.478 #/sec 2093.536 net-ms

Accumulate state in a c array:
2050558
2092008 µs/# 1 # 0.478 #/sec 2092.008 net-ms

Note how similar the times are for both c implementations - the first one most closely mirrors the Tcl implementation in that it accumulates the state in a Tcl list, and creates a Tcl_Obj representing each of the 10,000 results.

The second c implementation uses a plain c array with unboxed integer values, and only boxes the picked out ($r) result.

The performance is pretty much indistinguishable. But since the work done in that depth is washed out by a factor of 100,000 by the innermost loop, it shouldn't really be surprising. This is also why the Tcl implementations using associative arrays and lists show no meaningful performance difference (unless access into that structure is in the inner loop).

TCL performance: One billion loop

Tags:

tcl

Ricardo

1 Answers

Cyan Ogilvie

Recent Activity

Donate For Us

TCL performance: One billion loop

Tags:

tcl

Ricardo

1 Answers

Cyan Ogilvie

Related questions

Recent Activity

Donate For Us