Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map from one data structure to another

Tags:

haskell

Apologies for what seems like a basic question. I'm trying to figure out how to convert a Data.Text into a vector of Char. The best I could come up with is this:

textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
   append c vec = V.snoc vec c

This seems somewhat complicated (and computationally inefficient?) Is there a better way?

More generally, as a Haskell newbie, I'd appreciate any advice on how to go about figuring out this sort of thing for myself. I wasn't able to make much progress from either Google or searching through the documentation.

like image 325
happydave Avatar asked Dec 29 '16 00:12

happydave


1 Answers

Better is kind of subjective. What makes something better? Less code? Less memory consumption? Less time needed?

That being said, yes, your algorithm is computationally inefficient, since snoc has O(n), where n is the length of your (intermediate) vector. Therefore, you have created an O(n*n) algorithm (n snoc operations, with increasing length, therefore 1 + 2 + 3 + ... + (n-1) = n*(n-1)/2).

Depending on the constants hidden away in the big-O notation, one would expect your function to work slow on large lists. But before we benchmark it, let's think about the alternatives:

  • we can go over an intermediate data structure which provides conversions for both data structures, e.g. [Char] via V.fromList and T.pack
  • we can use unfold the vector with V.unfoldr by splitting single chars from the T.Text with T.uncons.

So let's use Criterion to create some benchmarks:

module Main where

import Criterion
import Criterion.Main
import Control.Monad (replicateM)
import qualified Data.Text as T
import qualified Data.Vector as V
import System.Random (randomRIO)

setupEnv n = fmap T.pack $ replicateM n (randomRIO (' ','~'))

textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
   append c vec = V.snoc vec c

example :: T.Text
example = T.pack "Hello, World"

benchOn :: T.Text -> [Benchmark]
benchOn ~e = map (\(s,f) -> bench s $ whnf f e)
     [("textToVec", textToVec)
     ,("T.foldr (flip V.snoc) V.empty", T.foldr (flip V.snoc) V.empty)
     ,("V.fromList . T.unpack", V.fromList . T.unpack)
     ,("V.unfoldr T.uncons", V.unfoldr T.uncons)
     ]

main = defaultMain [
    bgroup "example" $ benchOn example
  , env (setupEnv $  1000) $ \ ~small -> bgroup  "1000" $ benchOn small 
  , env (setupEnv $ 10000) $ \ ~small -> bgroup "10000" $ benchOn small 
  ]

I've used lts-5 to compile it:

stack exec --package random\
           --package vector\
           --package text\
           --resolver=lts-5\
      -- ghc -O2 Benchmark.hs

Benchmarking results

We first run all variants on "Hello, world", then on a randomly generated string of length 1000, and then one of length 10000.

"Hello, world"

benchmarking example/textToVec
time                 1.106 us   (1.098 us .. 1.117 us)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.102 us   (1.099 us .. 1.107 us)
std dev              12.67 ns   (6.277 ns .. 21.00 ns)

benchmarking example/T.foldr (flip V.snoc) V.empty
time                 1.225 us   (1.222 us .. 1.229 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.229 us   (1.226 us .. 1.232 us)
std dev              10.48 ns   (7.634 ns .. 16.00 ns)

benchmarking example/V.fromList . T.unpack
time                 643.6 ns   (640.9 ns .. 646.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 643.6 ns   (641.7 ns .. 645.7 ns)
std dev              6.525 ns   (5.573 ns .. 7.860 ns)

benchmarking example/V.unfoldr T.uncons
time                 518.1 ns   (516.1 ns .. 520.4 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 518.0 ns   (516.1 ns .. 520.1 ns)
std dev              6.541 ns   (4.943 ns .. 8.479 ns)
variance introduced by outliers: 11% (moderately inflated)

As you can see, at this point, unfolding or going via list takes about half the time compared to your function. Both the unfolding approach as well the fromList . unpack take half a microsecond.

Random text with 1000 characters

benchmarking 1000/textToVec
time                 1.311 ms   (1.302 ms .. 1.320 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.342 ms   (1.331 ms .. 1.366 ms)
std dev              55.31 us   (27.75 us .. 96.80 us)
variance introduced by outliers: 29% (moderately inflated)

benchmarking 1000/T.foldr (flip V.snoc) V.empty
time                 6.013 ms   (5.953 ms .. 6.081 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 6.054 ms   (6.028 ms .. 6.097 ms)
std dev              100.8 us   (62.45 us .. 180.7 us)

benchmarking 1000/V.fromList . T.unpack
time                 26.83 us   (26.77 us .. 26.92 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 26.90 us   (26.83 us .. 27.00 us)
std dev              264.1 ns   (209.7 ns .. 348.2 ns)

benchmarking 1000/V.unfoldr T.uncons
time                 15.05 us   (14.93 us .. 15.19 us)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 15.02 us   (14.95 us .. 15.15 us)
std dev              303.0 ns   (206.6 ns .. 428.3 ns)
variance introduced by outliers: 19% (moderately inflated)

This is where it gets interesting. Your function is better than the point-free one. One would have to look into GHCs core to see where the bottleneck is.

That being said, for a string that's about 100 times longer than our original one (length "Hello, World" == 12), our two contestants V.unfoldr T.uncons and V.fromList . T.unpack take about 30/45 times longer on that data compared to the example one.

Your function, on the other hand is needs 1ms compared to the previous 1µs. For 100 times the data 1000 times the time. Oh oh…

Random text with 10000 characters

benchmarking 10000/textToVec
time                 190.9 ms   (188.7 ms .. 192.8 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 192.5 ms   (191.8 ms .. 193.6 ms)
std dev              1.096 ms   (684.4 us .. 1.426 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking 10000/T.foldr (flip V.snoc) V.empty For a 
time                 859.3 ms   (856.5 ms .. 860.9 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 861.9 ms   (860.7 ms .. 862.6 ms)
std dev              1.042 ms   (0.0 s .. 1.173 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 10000/V.fromList . T.unpack
time                 567.8 us   (565.5 us .. 570.6 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 571.7 us   (569.8 us .. 574.0 us)
std dev              7.132 us   (5.674 us .. 9.545 us)

benchmarking 10000/V.unfoldr T.uncons
time                 200.6 us   (199.4 us .. 201.8 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 200.2 us   (199.4 us .. 201.2 us)
std dev              2.948 us   (2.240 us .. 3.837 us)

We increased the data size by factor 10. Your function takes 200 times the time, compared to the previous run, whereas both unfoldr and fromList take only 20 times the time.

The winner

According to the benchmarks, V.unfoldr T.uncons takes the cake. So how would you, as a Haskell newbie, find that function?

Well, it's not that obvious at first. The "obvious" approach is usually to use a list, since they're everywhere in Haskell, and to be honest, V.fromList . T.unpack seems to work quite well. Also, that trick works for every data structure that provides a fromList (as target) or toList (as source). Therefore, it works for anyFoldableinstance (Text` isn't one).

You could have found that if you looked into the documentation and saw the type of both V.unfoldr and T.uncons at the same time:

V.unfoldr :: (b      -> Maybe (a   , b     )) -> b      -> V.Vector a
T.uncons  ::  T.Text -> Maybe (Char, T.Text)
V.unfoldr T.uncons ::                            T.Text -> V.Vector Char

But now you know that there's the unfoldr/uncons pattern, so you know how to use it in the future.

like image 75
Zeta Avatar answered Oct 16 '22 00:10

Zeta