Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does one sort with Data.Vector.Generic.Mutable?

How can you sort a long list of data (strings, floats etc) that is read from a large file (say a few million lines) using a Data.Vector.Generic.Mutable object and a sort algorithm from Data.Vector.Algorithms?

like image 425
Max Avatar asked Sep 07 '10 02:09

Max


1 Answers

Here's how to do it in the general case.

First, you need a mutable vector. You can build this incrementally as you scan the file; allocate a vector that's about as big as you need, and increase the size and copy when you run out of space. Or you can read the entire file, count the record separators, and allocate the right amount of space all at once. This is easier but probably not acceptable In Real Life. (The expand-as-needed strategy is pretty common; if you ever use a language like Perl and push lines you read from a file onto the end of an array, this is what's happening. Perl allocates some space for an array, when you fill it, it increases the amount of space, allocates new space, and copies.)

Anyway, I'm too lazy to do this, so I am just going to create a vector with some random numbers in it.

We need a bunch of libraries for this:

import Control.Monad
import System.Random
import qualified Data.Vector as IV
import qualified Data.Vector.Mutable as MV
import qualified Data.Vector.Generic as V
import qualified Data.Vector.Algorithms.Intro as VA

We don't need this all at once, but we'll need it eventually, so I thought I'd get it out of the way.

Anyway, our mutable vector is going to be a "normal" mutable vector, here MV.MVector.

The idea of a mutable vector is that you create it and modify it in a number of steps. In Haskell, there are a few ways to make this look pure to the calling code; one is to do it all inside the ST monad. Your ST action is creating the vector, modifying it, and "freezing" it into an immutable vector. Internally, you are using fast modify-this-memory-location-a-bunch-of-times operations, but externally, you have something that is pure. (Read the paper about ST if you want an argument as to why this is safe.)

Another way to deal with mutable data is to just do it inside the Everything, erm, IO, monad. That's what we're going to do here, as it's most convenient.

(Data.Vector.Mutable has two predefined vector types for you, IOVector and STVector. We're using IOVector, which puts all the vector operations into IO.)

So like 8 paragraphs ago, we were going to create a mutable vector to sort. And here we are:

randVector :: IO (MV.IOVector Int)
randVector = do
  v <- MV.new 10
  forM [0..9] $ \x -> do
    r <- randomIO :: IO Int
    MV.write v x r
  return v

This is an IO action that returns a new mutable vector with 10 random numbers inside it. (Random number generation can also be conveniently lifted into the IO monad, so we did that too, for convenience! It's like we're writing C, but with nicer syntax and more type safety.)

That's actually the hard part. To do the sorting, I imported Data.Vector.Algorithms.Intro which is basically an in-place quicksort. A function called sort does the actual sorting (in whichever monad the mutable vector is in).

An action that creates the random mutable vector and sorts it in place looks like:

sort = VA.sort =<< randVector

Now, to print that out, all we need to do is "freeze" the vector into an immutable vector, and print out the .toList. Or you can just iterate over every element and print it.

Here's what I came up with as an example:

main = do
  v <- randVector
  VA.sort v
  iv <- V.unsafeFreeze v :: IO (IV.Vector Int)
  print . V.toList $ iv

V.unsafeFreeze is from Data.Vector.Generic (how you interact with all vector types with the same API), as is V.toList.

Anyway, it's worth noting that IO is purely for convenience here. Since you're building a vector from file data, it's appropriate. 99% of the time, though, you should use ST. In your ST action, create the vector, sort it, freeze it, and return the frozen version.

A similar example using STVector:

randVector :: ST s (Vector Int)
randVector = do
  vec <- new 10
  rand <- newSTRef 17
  forM_ [0..9] $ \index -> do
    randN <- readSTRef rand
    let randN' = (fst . next . mkStdGen) randN
    writeSTRef rand randN'
    write vec index randN'
  unsafeFreeze vec

Then run with:

*Main> runST randVector
fromList [679560,1422110406,306332632,1905242129,692062628,393451229,355476175,1240028023,873588529,1181443777] :: Data.Vector.Vector
like image 166
jrockway Avatar answered Nov 15 '22 09:11

jrockway