How is APL optimized to have great performance at array processing? What are some example tricks and optimizations it performs?

Tags:

I am interested in how APL is so efficient at what it does, to the point of sometimes being benchmarked as outperforming C. So I'm curious, what are some of the optimizations done by the APL compiler to make the language so efficient?

408

asked Jul 03 '19 22:07

Dan Harmon

2 Answers

You cannot compare two languages (like C vs. APL) as such in terms of performance because the performance depends considerably on the implementations of the languages and the libraries used. The same C program can be slow on one platform (read: Windows) and fast on another. The key point is that performance is almost entirely a property of a given implementation of a language and not a property of the language itself.

In the case of APL one can split the CPU cycles needed for a given operation into two parts: the interpreter overhead (processing of the tokens that make up an APL program) and the primitives (such as addition, reduce, etc). In most APL interpreters the interpreter overhead is rather small (which implies that optimizations on that part cannot gain much performance (aka. Amdahls law). In early APLs (say 1970) that was different. The processing of the APL primitives in current interpreters is implemented in C/C++ so that part of the CPU cycles is performance-wise the same as for C (again keeping in mind that the implementation could make a difference).

I have performed some benchmarks at the level of APL primitives (different scalar functions from simple (integer addition) to not so simple (complex arcus cosinus) and for outer products of them. The somewhat surprising result was that the performance of different scalar functions was not dominated by the complexity of the computed function but by the access time to/from the memory (including caches) and by the branch prediction of the CPU. As an example if you do the same APL operation in a loop then the second iteration would typically be twice as fast as the first and the sugsequent iteration would stabilize after about the fourth iteration (on an i5-4570 CPU).

The measured values were fluctuating a lot, which makes old-fashioned performance measurents (like interpreter X is twice as fast as interpreter Y) rather meaningless.

As a rule of thumb, if the average vector size (i.e. ⍴,X) of your APL program is 20 or more, then you can entirely ignore the interpreter overhead and the APL program has roughly the same performance as a comparable C program.

The cases where APL is faster than C (which is impossible in theory) can frequently be traced down to the use of different algorithms in C and in APL. A typical real-life example is sorting with heapsort in one case and with quicksort in the other. This is again a difference in implementations and not a difference of the languages themselves.

answered Oct 19 '22 17:10

Jürgen

Here are some examples of how it is done:

Stackless Traversal: blog
Packed Bit Booleans: blog, video
Vector Instructions: video
Hashed arrays: video, documentation
Special code for certain phrases: video
Pipelines and CRCs: video

Relatedly, these discuss the principles behind the above:

Choosing algorithms depending on data patterns seen at runtime and how a lazy/thunked APL could skip some code entirely: video
Less seek latency by reading entire simple arrays and avoiding branch prediction failures through branchless code: video (APL aligns with these ideas, and encourages these styles, more easily than many other languages)

answered Oct 19 '22 17:10

Adám

Related questions
                            
                                ListView is very slow- android
                            
                                Delphi: Calling a C dll function with Debugger takes 15 s without debugger 0.16 s. Why?
                            
                                C++ signed and unsigned int vs long long speed
                            
                                Will the C# compiler optimize calls to a same method inside a loop?
                            
                                let vs var performance in nodejs and chrome
                            
                                Why is Haskell faster than C++ for a simple fibonacci
                            
                                Time complexity for a sublist in Python
                            
                                Do Java Mission Control and Flight Recorder deliver same functionality as VisualVM?
                            
                                Computing Jaccard Similarity in Python
                            
                                Defining BoundedCapacity degrades performance
                            
                                Why is this C++ code execution so slow compared to java?
                            
                                Performance of std::partial_sort() versus std::sort() when sorting the whole range?
                            
                                What is the difference between 'keys.each' and 'each_key'?
                            
                                Performance: Matlab vs C++ Matrix vector multiplication
                            
                                Most efficient itab filtering with ABAP 7.40+ syntax
                            
                                Efficiently find overlap of date-time ranges from 2 dataframes
                            
                                How to do load testing of a PHP website from both ends
                            
                                Obtain `min` and `idxmin` (or `max` and `idxmax`) at the same time ("simultaneously")?
                            
                                pandas idxmax: return all rows in case of ties
                            
                                Why is ThreadPoolExecutor's default max_workers decided based on the number of CPUs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is APL optimized to have great performance at array processing? What are some example tricks and optimizations it performs?

Tags:

performance

compiler-optimization

interpreted-language

apl

Dan Harmon

People also ask

2 Answers

Jürgen

Adám

Recent Activity

Donate For Us