Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: Lists vs Streams

Tags:

I've noticed streams seem to act a lot like lists, except with constant time append. Of course, adding constant time append to lists isn't too complicated, and DList does exactly that.

Lets assume for the rest of the discussion that either lists have constant time append, or that we're simply not interested in it.

My thought is that Haskell lists should simply be implemented as streams. For this not to be the case, I assume that the following would need to hold:

  1. There are cases where lists are better than streams AND
  2. There are cases where streams are better than lists.

My question is: what are examples of the two above cases?

Note: For the purpose of this question, please ignore easily fixable omissions in the particular implementations I've discussed. I'm looking more for core structural differences here.

Additional info:

I guess part of what I'm getting at here is say if we write [1..1000000], does a Haskell compiler (say GHC) do:

  1. Make a list OR
  2. Make an object with two ints: 1 and 1000000 which fully describes the list.

If it's case (1), why do this, as creating intermediate lists seems to be an unnecessary performance penalty?

Or if it's case (2), then why do we need streams?

like image 468
Clinton Avatar asked Jun 08 '12 02:06

Clinton


People also ask

Are Haskell lists linked lists?

Thus, Haskell lists are singly-linked.

Can Haskell lists have different types?

Recall earlier that Haskell has many different kinds of types, such as * for value-containing types, [*] for type-level lists, etc. Haskell also has a special kind called Symbol from the module GHC.

What is stream fusion?

This paper presents an automatic deforestation system, stream fusion, based on equational transformations, that fuses a wider range of functions than existing short-cut fusion systems. In particular, stream fusion is able to fuse zips, left folds and functions over nested lists, including list comprehensions.


2 Answers

When you write [1..1000000], what GHC really does is create an object that contains 1 and 1000000 that describes how to build the list of interest; that object is called a "thunk". The list is only built as necessary to satisfy case scrutinees; for example, you might write:

printList [] = putStrLn "" printList (x:xs) = putStrLn (show x) >> printList xs  main = printList [1..1000000] 

Which evaluates like this:

main = { definition of main } printList [1..1000000] = { list syntax sugar } printList (enumFromTo 1 1000000) = { definition of printList } case enumFromTo 1 1000000 of     [] -> putStrLn ""     x:xs -> putStrLn (show x) >> printList xs = { we have a case, so must start evaluating enumFromTo;     I'm going to skip a few steps here involving unfolding     the definition of enumFromTo and doing some pattern     matching } case 1 : enumFromTo 2 1000000 of     [] -> putStrLn ""     x:xs -> putStrLn (show x) >> printList xs = { now we know which pattern to choose } putStrLn (show 1) >> printList (enumFromTo 2 1000000) 

Then you'd find that 1 was printed to the console, and we'd start from near the top with enumFromTo 2 1000000 instead of enumFromTo 1 1000000. Eventually, you'd get all the numbers printed and it would come time to evaluate

printList (enumFromTo 1000000 1000000) = { definition of printList } case enumFromTo 1000000 1000000 of     [] -> putStrLn ""     x:xs -> putStrLn (show x) >> printList xs = { skipping steps again to evaluate enumFromTo } case [] of     [] -> putStrLn ""     x:xs -> putStrLn (show x) >> printList xs = { now we know which pattern to pick } putStrLn "" 

and evaluation would be finished.

The reason we need streams is a bit subtle. The original paper, Stream fusion: From lists to streams to nothing at all, probably has the most complete explanation. The short version is that when you have a long pipeline:

concatMap foo . map bar . filter pred . break isSpecial 

...it's not so obvious how to get the compiler to compile away all the intermediate lists. You might notice that we can think of the lists as having a sort of "state" that's being iterated, and that each of these functions, rather than traversing a list, is just changing the way the state gets modified on each iteration. The Stream type tries to make this explicit, and the result is stream fusion. Here's how it looks: we first convert all these functions into stream versions:

(toList . S.concatMap foo . fromList) . (toList . S.map bar . fromList) . (toList . S.filter pred . fromList) . (toList . S.break isSpecial . fromList) 

then observe that we can always annihilate fromList . toList:

toList . S.concatMap foo . S.map bar . S.filter pred . S.break . fromList 

...and then the magic happens because the chain S.concatMap foo . S.map bar . S.filter pred . S.break builds up an iterator explicitly rather than building it implicitly by internally building and then immediately annihilating actual lists.

like image 155
Daniel Wagner Avatar answered Oct 24 '22 03:10

Daniel Wagner


The advantage of streams is they are more powerful. The interface:

data Stream m a = forall s . Stream (s -> m (Step s a)) s Size    

lets you do many things that normal lists cannot. Eg:

  • Track the size (eg Unknown, Max 34, Exact 12)
  • Perform monadic actions to get the next element. Lists can partly do this with lazy IO, but that technique has proved to be error prone, and normally is only used by beginners, or for simple small scripts.

However, they have a big downside as compared to lists - complexity! For a beginner programmer, to understand streams you have to be on top of existential types and monadic actions. It would be very hard to learn haskell if to use the basic list type you had to learn those two complex subjects.

Compare that to lists, which have the interface:

data [] a = a : [a] | [] 

This is very simple, and something that can be taught easily to a new programmer.

Another advantage of lists is you can pattern match them simply. For example:

getTwo (a : b : _) = Just (a,b) getTwo _ = Nothing 

This is both useful to experienced programmers (I still use list pattern matching in many methods), and for beginner programmers who haven't yet learnt the standard higher order functions that can be used to manipulate lists.

Efficiency is also another potential advantage of lists, since ghc has spent a lot of time working on list fusion. In a lot of code, intermediate lists are never generated. That could be a lot harder to optimize with streams.

So I think it would be a poor choice to swap lists with Streams. The current situation is better, where you can bring them in if you need them, but beginners aren't stuck with their complexity and skilled users don't have to lose pattern matching.

EDIT: about [1..1000000]:

This is equivalent to enumFromTo 1 1000000, which is lazily evaluated, and subject to fusion (which makes it very efficient). Eg sum [1..1000000] would not generate any lists (and use constant memory) with optimisation turned on. So case (2) is correct, this situation isn't an advantage for streams due to lazy evaluation. As noted above though, streams have other advantages over lists.

like image 37
David Miani Avatar answered Oct 24 '22 03:10

David Miani