Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is APL optimized to have great performance at array processing? What are some example tricks and optimizations it performs?

I am interested in how APL is so efficient at what it does, to the point of sometimes being benchmarked as outperforming C. So I'm curious, what are some of the optimizations done by the APL compiler to make the language so efficient?

like image 408
Dan Harmon Avatar asked Jul 03 '19 22:07

Dan Harmon


People also ask

Is APL fast?

While a language such as APL cannot inherently be fast or slow, it is often described as being suitable to high-performance implementation, and there are many APL implementations focused partially or exclusively on performance.

What was APL used for?

APL is an array-oriented programming language that will change the way you think about problems and data. With a powerful, concise syntax, it lets you develop shorter programs that enable you to think more about the problem you're trying to solve than how to express it to a computer.

What does APL stand for programming?

APL (named after the book A Programming Language) is a programming language developed in the 1960s by Kenneth E. Iverson. Its central datatype is the multidimensional array.

Is APL useful?

APL is useful as a language in its own right, but with a proliferation of parallel domain specific languages, including things like ArBB3 and Copperhead,4 APL presents a number of advantages to domain specific languages. Its simple syntax makes it easy to map problems in one DSL to APL and back.


2 Answers

You cannot compare two languages (like C vs. APL) as such in terms of performance because the performance depends considerably on the implementations of the languages and the libraries used. The same C program can be slow on one platform (read: Windows) and fast on another. The key point is that performance is almost entirely a property of a given implementation of a language and not a property of the language itself.

In the case of APL one can split the CPU cycles needed for a given operation into two parts: the interpreter overhead (processing of the tokens that make up an APL program) and the primitives (such as addition, reduce, etc). In most APL interpreters the interpreter overhead is rather small (which implies that optimizations on that part cannot gain much performance (aka. Amdahls law). In early APLs (say 1970) that was different. The processing of the APL primitives in current interpreters is implemented in C/C++ so that part of the CPU cycles is performance-wise the same as for C (again keeping in mind that the implementation could make a difference).

I have performed some benchmarks at the level of APL primitives (different scalar functions from simple (integer addition) to not so simple (complex arcus cosinus) and for outer products of them. The somewhat surprising result was that the performance of different scalar functions was not dominated by the complexity of the computed function but by the access time to/from the memory (including caches) and by the branch prediction of the CPU. As an example if you do the same APL operation in a loop then the second iteration would typically be twice as fast as the first and the sugsequent iteration would stabilize after about the fourth iteration (on an i5-4570 CPU).

The measured values were fluctuating a lot, which makes old-fashioned performance measurents (like interpreter X is twice as fast as interpreter Y) rather meaningless.

As a rule of thumb, if the average vector size (i.e. ⍴,X) of your APL program is 20 or more, then you can entirely ignore the interpreter overhead and the APL program has roughly the same performance as a comparable C program.

The cases where APL is faster than C (which is impossible in theory) can frequently be traced down to the use of different algorithms in C and in APL. A typical real-life example is sorting with heapsort in one case and with quicksort in the other. This is again a difference in implementations and not a difference of the languages themselves.

like image 75
Jürgen Avatar answered Oct 19 '22 17:10

Jürgen


Here are some examples of how it is done:

  • Stackless Traversal: blog

  • Packed Bit Booleans: blog, video

  • Vector Instructions: video

  • Hashed arrays: video, documentation

  • Special code for certain phrases: video

  • Pipelines and CRCs: video

Relatedly, these discuss the principles behind the above:

  • Choosing algorithms depending on data patterns seen at runtime and how a lazy/thunked APL could skip some code entirely: video

  • Less seek latency by reading entire simple arrays and avoiding branch prediction failures through branchless code: video (APL aligns with these ideas, and encourages these styles, more easily than many other languages)

like image 31
Adám Avatar answered Oct 19 '22 17:10

Adám