Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell vector C++ push_back analogue

I've discovered that Haskell Data.Vector.* miss C++ std::vector::push_back's functionality. There is grow/unsafeGrow, but they seem to have O(n) complexity.

Is there a way to grow vectors in O(1) amortized time for an element?

like image 601
Aleksandr Pakhomov Avatar asked Jul 23 '15 21:07

Aleksandr Pakhomov


People also ask

Does Push_back make a copy C++?

Yes, std::vector<T>::push_back() creates a copy of the argument and stores it in the vector.

What does Push_back mean?

The push_back() function is used to insert an element at the end of a vector.

What is a vector in Haskell?

Vector is a Haskell library for working with arrays. It has an emphasis on very high performance through loop fusion, whilst retaining a rich interface. The main data types are boxed and unboxed arrays, and arrays may be immutable (pure), or mutable.

What is a vector in Haskell?

Numeric Haskell: A Vector Tutorial. Vector is a Haskell library for working with arrays. It has an emphasis on very high performance through loop fusion, whilst retaining a rich interface. The main data types are boxed and unboxed arrays, and arrays may be immutable (pure), or mutable.

How to push an element into a vector from the back?

vector::push_back () push_back () function is used to push elements into a vector from the back. The new value is inserted into the vector at the end, after the current last element and the container size is increased by 1.

What is the difference between vector push_back and pop_back in C++ STL?

vector::push_back() and vector::pop_back() in C++ STL. Vectors are same as dynamic arrays with the ability to resize itself automatically when an element is inserted or deleted, with their storage being handled automatically by the container. vector::push_back() push_back() function is used to push elements into a vector from the back.

How to create a multidimensional array in Haskell?

You can create the arrays in many ways, for example, from a regular Haskell list: GHCi will print the contents of the vector as executable code. To create a multidimensional array, you can use a nested list generator to fill it: -- XXX TODO need a better printing function for multidimensional arrays.


1 Answers

No there really is no such facility in Data.Vector. It isn't too difficult to implement this from scratch using MutableArray like Data.Vector.Mutable does (see my implementation below), but there are some significant drawbacks. In particular, all of its operations end up happening inside some state context usually ST or IO. This has the downsides that

  1. Any code that manipulates such a data structure ends up having to be monadic
  2. The compiler is much less likely to be able to optimize. For example, libraries like vector use something really clever called fusion to optimize away intermediate allocations. This sort of thing is not possible in a state context.
  3. Parallelism is going to be a lot tougher: in ST I can't even have two threads and in IO I will have race conditions all over the place. The nasty bit here is that any sharing is going to have to happen in IO.

As if all this wasn't enough, garbage collection also performs better inside pure code.

What do I do then?

It isn't particularly often that you have a need for exactly this behaviour - usually you are better off using an immutable data structure (thereby avoiding all of the aforementioned problems) which does something similar. Just limiting ourselves to containers which comes with GHC, some alternatives include:

  • if you are almost always just using push_back, maybe you just want a stack (a plain old [a]).
  • if you anticipate doing more push_back than lookups, Data.Sequence gives you O(1) appending to either end and O(log n) lookup.
  • if you are interested in a lot of operations especially hashmap-like, Data.IntMap is pretty optimized. Even if the theoretical cost of those operations is O(log n), you will need a pretty big IntMap to start feeling those costs.

Making something like C++ vector

Of course, if one doesn't care about the restrictions mentioned initially, there is no reason not to have a C++ like vector. Just for fun, I went ahead and implemented this from scratch (needs packages data-default and primitive).

The reason this code is probably not already in some library is that it goes against much of the spirit of Haskell (I do this with the intent of conforming to a C++ style vector).

  • The only operation that actually makes a new vector is newVector - everything else "modifies" an existing vector. Since pushBack doesn't return a new GrowVector, it has to modify the existing one (including its length and/or capacity), so length and capacity have to be "pointers". In turn, that means that even getting the length is a monadic operation.
  • While this isn't unboxed, it would not be too difficult to replicate vectors data family approach - it is just tedious1.

With that said:

module GrowVector (
  GrowVector, newEmpty, size, read, write, pushBack, popBack
) where 

import Data.Primitive.Array
import Data.Primitive.MutVar
import Data.Default
import Control.Monad
import Control.Monad.Primitive (PrimState, PrimMonad)
import Prelude hiding (length, read)

data GrowVector s a = GrowVector
  { underlying :: MutVar s (MutableArray s a) -- ^ underlying array
  , length :: MutVar s Int                    -- ^ perceived length of vector
  , capacity :: MutVar s Int                  -- ^ actual capacity
  }

type GrowVectorIO = GrowVector (PrimState IO)

-- | Make a new empty vector with the given capacity. O(n)
newEmpty :: (Default a, PrimMonad m) => Int -> m (GrowVector (PrimState m) a)
newEmpty cap = do
  arr <- newArray cap def
  GrowVector <$> newMutVar arr <*> newMutVar 0 <*> newMutVar cap

-- | Read an element in the vector (unchecked). O(1)
read :: PrimMonad m => GrowVector (PrimState m) a -> Int -> m a
g `read` i = do arr <- readMutVar (underlying g); arr `readArray` i

-- | Find the size of the vector. O(1)
size :: PrimMonad m => GrowVector (PrimState m) a -> m Int
size g = readMutVar (length g)

-- | Double the vector capacity. O(n)
resize :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> m ()
resize g = do
  curCap <- readMutVar (capacity g)         -- read current capacity
  curArr <- readMutVar (underlying g)       -- read current array
  curLen <- readMutVar (length g)           -- read current length
  newArr <- newArray (2 * curCap) def       -- allocate a new array twice as big
  copyMutableArray newArr 1 curArr 1 curLen -- copy the old array over
  underlying g `writeMutVar` newArr         -- use the new array in the vector
  capacity g `modifyMutVar'` (*2)           -- update the capacity in the vector

-- | Write an element to the array (unchecked). O(1)
write :: PrimMonad m => GrowVector (PrimState m) a -> Int -> a  -> m ()
write g i x = do arr <- readMutVar (underlying g); writeArray arr i x

-- | Pop an element of the vector, mutating it (unchecked). O(1)
popBack :: PrimMonad m => GrowVector (PrimState m) a -> m a
popBack g = do
  s <- size g;
  x <- g `read` (s - 1)
  length g `modifyMutVar'` (+ negate 1)
  pure x

-- | Push an element. (Amortized) O(1)
pushBack :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> a -> m ()
pushBack g x = do
  s <- readMutVar (length g)                -- read current size
  c <- readMutVar (capacity g)              -- read current capacity
  when (s+1 == c) (resize g)                -- if need be, resize
  write g (s+1) x                           -- write to the back of the array
  length g `modifyMutVar'` (+1)             -- increase te length

Current semantics of grow

I think the github issue does a pretty good job of explaining the semantics:

I think the intended semantics are that it may do a realloc, but not guaranteed to, and all the current implementations do the simpler copying semantics because for on heap allocations the cost should be roughly the same.

Basically you should use grow when you want a new mutable vector of an increased size, starting with the elements of the old vector (and no longer care about the old vector). This is quite useful - for example one could implement GrowVector using MVector and grow.


1 the approach is that for every new type of unboxed vector you want to have, you make a data instance that "expands" your type into a fixed number of unboxed arrays (or other unboxed vectors). This is the point of data family - to allow different instantiations of a type to have totally different runtime representations, and to also be extensible (you can add your own data instance if you want).

like image 75
Alec Avatar answered Sep 19 '22 04:09

Alec