Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can non-persistent data structures be used in a purely functional way?

This is a general question about functional programming but I'm also interested what answer it has in specific languages.

I only have beginner's knowledge about functional languages, so bear with me.

It is my understanding that functional languages put a focus on different data structures than imperative languages, as they like immutability: The persistent data structures.

For example, they all have an immutable list concept where you can form a new lists x :: l and y :: l from an existing list l and two new items x and y without all elements of l needing to be copied. This is likely implemented by the new list object internally pointing to the old one as the tail.

In imperative languages, such a data structure is rarely used, as they don't provide as good locality of reference as c-style arrays do.

In general, finding data structures that support the functional style is an endeavor of it's own, so it would be great if one wouldn't always have to do that.

Now here's an idea how one could use all classical data structures in functional programming if there's the right language support for it.

In general, a data structure in an imperative language has modifying operations defined on it (in pseudo-code):

data.modify(someArgument)

The functional way of writing this is

newData = modified(data, someArgument)

The general problem is that this normally requires copying the data structure - except if the language could know that data will in fact not be used by anything else: Then, the modification could be done in the form of mutating the original and no one could tell the difference.

There is a large class of cases where the language could infer that property of "never used elsewhere": When the first argument to modified is an unbound value, as in this example:

newData = modified(modified(data, someArgument))

Here, data may be used elsewhere, but modified(data, someArgument) clearly isn't.

This is what in C++ is called an "rvalue", and in the latest incarnation of C++, which ironically isn't functional at all otherwise, one can overload on such rvalues.

For example, one can write:

Data modified(Data const& data) { // returns a modified copy }
Data modified(Data && data) { // returns the modified original }

That means that in C++, one can actually take any mutable efficient data structure and convert it to having an immutable api that can be used in a purely functional way just as efficiently as the imperative version would be.

(There's the caveat that in C++ still sometimes casting is necessary to force the rvalue overload. And of course care need to be taken on implementing such data structures, ie. when using the rvalue overloads. That could probably be improved on though.)

Now my question:

Do actual functional languages have a similar mechanism? Or is this not necessary for some other reason?

(I tagged some specific languages I'm particularly interested in.)

like image 904
John Avatar asked Oct 23 '16 12:10

John


2 Answers

It is true that persistent data structures are slower than their mutable counterparts. There is no dispute about that. Sometimes the difference is negligible (iterating over a linked list versus an array), other times it can be big (iterating in reverse), but that's not the point. The choice to use immutable data is (or should be) a conscious one: we're trading performance for stability.

Consider this point: for most (not all) modern programs, local performance is not a concern. For today's programs, the real performance bottleneck is in parallelization - both on local machine with shared memory, as well as across different machines. With the amounts of data we're processing these days, squeezing every last bit out of memory locality and branch prediction isn't going to cut it. We need scale. And guess the number one source of bugs in parallel programs? That's right - mutation.

Another big concern for modern programs is stability. Long gone are the days when a program could crash on you, and you just restarted it and kept working. Today's programs need to work on headless servers, without human intervention at all, for months or years. Today's program can't just throw up its digital arms and expect a human to figure out what went wrong. In this setting, local performance is so much less important than stability and parallelization: it's much cheaper to buy (or rent) another ten servers than to hire a human for restarting the program every now and then.

It is true that one can make a parallelizable and stable program using mutation. It is theoretically possible. It's just way harder. With immutable data, you have to actually aim at your foot first.

And then, here's some perspective: we've already been there. How often do you use the goto instruction in your code? Have you considered why that is? You can do all kinds of neat performance tricks with goto, and yet we choose not to. At some point in programming history, we have decided that goto was more trouble than it was worth. Same thing happened with raw pointers: many languages don't have them at all, others have them closely guarded, and even in those languages that have unrestricted access to raw pointers, it's now considered kind of a bad form to use them. Today, we're in the middle of the next stage: first we gave up goto, then we gave up raw pointers, now we're slowly giving up mutation.

However, if you really find yourself pushing the envelope of local performance for a legitimate reason, and you have determined that the immutable data is indeed the bottleneck (remember: first measure, then optimize), then most functional languages (except Haskell and Elm) will let you get away with mutation, albeit reluctantly. Just like raw pointers in C#, you can have mutation, you just have to be explicit (and careful) about it. In F#, for example, you can have mutable variables, raw arrays, mutable records, classes, interfaces, etc. It is possible, just not recommended. And the general consensus so far is that it's OK to employ mutation as long as it's localized (i.e. does not leak to outside), and you really know what you're doing, and you've documented it, and tested it to the death.

A common case for this is "value construction", where your have a function that ultimately produces an immutable value, but does all kinds of messy stuff while doing it. One example is how F# core library implements List.map: normally, because lists are iterated front-to-back, but constructed back-to-front, one would need to first construct the transformed list by iterating, and then reverse it. So the F# compiler cheats here and mutates the list as it's being constructed in order to avoid the extra reversal.

And another note on the "locality" concern. Remember how I mentioned that you could do all kinds of neat performance tricks with goto? Well, that's not quite true anymore. Since the programmers started writing programs without goto, the binary code became more predictable, because jumps are now generated by compilers, not coded by humans. This made it possible for CPUs to, well, predict them, and optimize processing based on these predictions. The end result is that now you're actually likely to get worse performance by indiscriminate use of goto than by using accepted higher-level tools like loops. Back in the day, CPUs couldn't afford to be that smart, so that the choice to not use goto was purely a stability measure. But now it turned out to be actually helpful with performance, who would have thought?

I submit that the same thing will happen with immutability as well. I'm not sure exactly how it will happen, but I'm sure that it will. Even today, without special hardware, it is still possible to do some optimization at compile time: for example, if the compiler knows that a variable is immutable, it may decide to cache it in a register for a long period, or even promote it to a constant altogether. It is true that most actual compilers today don't perform all of these possible optimizations (though they do perform some), but they will. We're only just beginning. :-)

like image 143
Fyodor Soikin Avatar answered Sep 18 '22 23:09

Fyodor Soikin


I'm pretty sure that a feature like alias analysis (checking if data is used elsewhere) is not part of the Scala compiler (nor part of other FP languages like Haskell and Clojure). The collections API in Scala (for example) is explicitly split into immutable and mutable packages. The immutable data structures use the concept of structural sharing to negate the need to copy data and therefore reduce the overhead (in terms of the amount of temporal data) of working with immutable structures.

As you already pointed out, methods like cons :: create a new immutable structure which, under the hood, contains a reference to any existing immutable data (rather than making a copy of it).

Conversions between mutable and immutable types (in Scala for example) make a copy of the mutable data (usually in a lazy fashion), rather than using any mechanisms such as checking whether the mutable structure is not referred to anywhere else and allowing it to be mutated.

When I first moved over from Java to Scala I initially thought that the (often) great deal of temporal data that must get created when working with immutable structures might be a performance constraint and involve some clever techniques to actually allow mutation where it was safe to do so but this is not so because the idea is that immutable data never points to younger values. Since younger values do not exist at the time when an old value is created, it cannot be pointed to at the time of creation, and since values are never modified, neither can it be pointed to later. The outcome is that FP languages like Scala/Haskell are able to generate all this temporal data since the garbage collectors are able to remove it in a very short time frame.

In a nutshell, Scala/Haskell (I'm not sure about F#) do not allow mutation of immutable structures because the state of runtime environments like the current JVM's have very efficient garbage collection and therefore temporal data can be removed very quickly. Of course, as I'm sure you are aware, an immutable structure containing mutable elements is perfectly possible in FP languages like Scala but although the mutable elements can be changed, the immutable container cannot ie. elements can neither be added/removed.

like image 36
jacks Avatar answered Sep 20 '22 23:09

jacks