Apologies for what seems like a basic question. I'm trying to figure out how to convert a Data.Text into a vector of Char. The best I could come up with is this:
textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
append c vec = V.snoc vec c
This seems somewhat complicated (and computationally inefficient?) Is there a better way?
More generally, as a Haskell newbie, I'd appreciate any advice on how to go about figuring out this sort of thing for myself. I wasn't able to make much progress from either Google or searching through the documentation.
Better is kind of subjective. What makes something better? Less code? Less memory consumption? Less time needed?
That being said, yes, your algorithm is computationally inefficient, since snoc
has O(n), where n
is the length of your (intermediate) vector. Therefore, you have created an O(n*n) algorithm (n snoc
operations, with increasing length, therefore 1 + 2 + 3 + ... + (n-1) = n*(n-1)/2).
Depending on the constants hidden away in the big-O notation, one would expect your function to work slow on large lists. But before we benchmark it, let's think about the alternatives:
[Char]
via V.fromList
and T.pack
V.unfoldr
by splitting single chars from the T.Text
with T.uncons
.So let's use Criterion to create some benchmarks:
module Main where
import Criterion
import Criterion.Main
import Control.Monad (replicateM)
import qualified Data.Text as T
import qualified Data.Vector as V
import System.Random (randomRIO)
setupEnv n = fmap T.pack $ replicateM n (randomRIO (' ','~'))
textToVec :: T.Text -> V.Vector Char
textToVec t = T.foldr append V.empty t where
append c vec = V.snoc vec c
example :: T.Text
example = T.pack "Hello, World"
benchOn :: T.Text -> [Benchmark]
benchOn ~e = map (\(s,f) -> bench s $ whnf f e)
[("textToVec", textToVec)
,("T.foldr (flip V.snoc) V.empty", T.foldr (flip V.snoc) V.empty)
,("V.fromList . T.unpack", V.fromList . T.unpack)
,("V.unfoldr T.uncons", V.unfoldr T.uncons)
]
main = defaultMain [
bgroup "example" $ benchOn example
, env (setupEnv $ 1000) $ \ ~small -> bgroup "1000" $ benchOn small
, env (setupEnv $ 10000) $ \ ~small -> bgroup "10000" $ benchOn small
]
I've used lts-5 to compile it:
stack exec --package random\
--package vector\
--package text\
--resolver=lts-5\
-- ghc -O2 Benchmark.hs
We first run all variants on "Hello, world", then on a randomly generated string of length 1000, and then one of length 10000.
benchmarking example/textToVec
time 1.106 us (1.098 us .. 1.117 us)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.102 us (1.099 us .. 1.107 us)
std dev 12.67 ns (6.277 ns .. 21.00 ns)
benchmarking example/T.foldr (flip V.snoc) V.empty
time 1.225 us (1.222 us .. 1.229 us)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.229 us (1.226 us .. 1.232 us)
std dev 10.48 ns (7.634 ns .. 16.00 ns)
benchmarking example/V.fromList . T.unpack
time 643.6 ns (640.9 ns .. 646.4 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 643.6 ns (641.7 ns .. 645.7 ns)
std dev 6.525 ns (5.573 ns .. 7.860 ns)
benchmarking example/V.unfoldr T.uncons
time 518.1 ns (516.1 ns .. 520.4 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 518.0 ns (516.1 ns .. 520.1 ns)
std dev 6.541 ns (4.943 ns .. 8.479 ns)
variance introduced by outliers: 11% (moderately inflated)
As you can see, at this point, unfolding or going via list takes about half the time compared to your function. Both the unfolding approach as well the fromList . unpack
take half a microsecond.
benchmarking 1000/textToVec
time 1.311 ms (1.302 ms .. 1.320 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 1.342 ms (1.331 ms .. 1.366 ms)
std dev 55.31 us (27.75 us .. 96.80 us)
variance introduced by outliers: 29% (moderately inflated)
benchmarking 1000/T.foldr (flip V.snoc) V.empty
time 6.013 ms (5.953 ms .. 6.081 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 6.054 ms (6.028 ms .. 6.097 ms)
std dev 100.8 us (62.45 us .. 180.7 us)
benchmarking 1000/V.fromList . T.unpack
time 26.83 us (26.77 us .. 26.92 us)
1.000 R² (1.000 R² .. 1.000 R²)
mean 26.90 us (26.83 us .. 27.00 us)
std dev 264.1 ns (209.7 ns .. 348.2 ns)
benchmarking 1000/V.unfoldr T.uncons
time 15.05 us (14.93 us .. 15.19 us)
0.999 R² (0.999 R² .. 1.000 R²)
mean 15.02 us (14.95 us .. 15.15 us)
std dev 303.0 ns (206.6 ns .. 428.3 ns)
variance introduced by outliers: 19% (moderately inflated)
This is where it gets interesting. Your function is better than the point-free one. One would have to look into GHCs core to see where the bottleneck is.
That being said, for a string that's about 100 times longer than our original one (length "Hello, World" == 12
), our two contestants V.unfoldr T.uncons
and V.fromList . T.unpack
take about 30/45 times longer on that data compared to the example one.
Your function, on the other hand is needs 1ms compared to the previous 1µs. For 100 times the data 1000 times the time. Oh oh…
benchmarking 10000/textToVec
time 190.9 ms (188.7 ms .. 192.8 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 192.5 ms (191.8 ms .. 193.6 ms)
std dev 1.096 ms (684.4 us .. 1.426 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 10000/T.foldr (flip V.snoc) V.empty For a
time 859.3 ms (856.5 ms .. 860.9 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 861.9 ms (860.7 ms .. 862.6 ms)
std dev 1.042 ms (0.0 s .. 1.173 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 10000/V.fromList . T.unpack
time 567.8 us (565.5 us .. 570.6 us)
1.000 R² (1.000 R² .. 1.000 R²)
mean 571.7 us (569.8 us .. 574.0 us)
std dev 7.132 us (5.674 us .. 9.545 us)
benchmarking 10000/V.unfoldr T.uncons
time 200.6 us (199.4 us .. 201.8 us)
1.000 R² (1.000 R² .. 1.000 R²)
mean 200.2 us (199.4 us .. 201.2 us)
std dev 2.948 us (2.240 us .. 3.837 us)
We increased the data size by factor 10. Your function takes 200 times the time, compared to the previous run, whereas both unfoldr and fromList
take only 20 times the time.
According to the benchmarks, V.unfoldr T.uncons
takes the cake. So how would you, as a Haskell newbie, find that function?
Well, it's not that obvious at first. The "obvious" approach is usually to use a list, since they're everywhere in Haskell, and to be honest, V.fromList . T.unpack
seems to work quite well. Also, that trick works for every data structure that provides a fromList
(as target) or toList (as source). Therefore, it works for any
Foldableinstance (
Text` isn't one).
You could have found that if you looked into the documentation and saw the type of both V.unfoldr
and T.uncons
at the same time:
V.unfoldr :: (b -> Maybe (a , b )) -> b -> V.Vector a
T.uncons :: T.Text -> Maybe (Char, T.Text)
V.unfoldr T.uncons :: T.Text -> V.Vector Char
But now you know that there's the unfoldr/uncons pattern, so you know how to use it in the future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With