I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd
runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3
. A performance skeleton which calls putchar('\n')
8 times gets comparable performance.
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields
. (I haven't recompiled my libraries with -fllvm
, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector
's into ByteString
's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString
's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd
, but obviously much better.
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
forM [1 .. 10000000] $ \j -> putStrLn (show j)
) took 8.64s, about 8.8MB/s.forM_
instead of forM
took 5.64s, about 13.5MB/s.ByteString
version from dflemstr's answer took 9.13s, about 8.3MB/s.Text
version from dflemstr's answer took 5.64s, about 13.5MB/s.Vector
version from the question took 5.54s, about 13.7MB/s.main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000]
, where C
is Data.ByteString.Char8
, took 4.25s, about 17.9MB/s.putStr . unlines . map show $ [1 :: Int .. 10000000]
took 3.06s, about 24.8MB/s.A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a'
took 1.15s, about 66MB/s.main = C.putStrLn $ C.replicate 78888896 'a'
where C
is Data.ByteString.Char8
, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteString
s.First, don't use forM
or mapM
unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString
output can be very fast (10.), but if the construction of the ByteString
to output is slow (3.), you end up with slower code than the naive String
output.
What's so terrible about 3.? Well, all the involved String
s are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty
is put, then the resulting list is concatenated, which means all these Empty
s are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...))))
is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text
and the fixed "naive" String
version can be achieved by pack
ing to strict ByteString
s and using fromChunks
(and Data.List.intersperse
for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the String
s, using \k -> shows k "\n"
instead of show
, the concatenation has to deal with half as many slightly longer ByteString
s, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String
version.
Now, 6. shows that ByteString
output is faster than String
output, enough that in this case the additional work of pack
ing is more than compensated. However, don't be fooled by that to believe that is always so. If the String
s to pack are long, the packing can take more time than the String
output.
But ten million invocations of putStrLn
, be it the String
or the ByteString
version, take a lot of time. It's faster to grab the stdout
Handle
just once and construct the output String
in non-IO code. unlines
already does well, but we still suffer from the construction of the list map show [1 .. 10^7]
. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7]
, that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteString
s without going through String
via show
, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.
Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use String
s for the numbers here, which leads to horrible slowdowns. If we switch to the text
library instead of the bytestring
library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv
tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.
If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int]
to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut
is doing (assuming an import qualified Data.ByteString.Lazy as L
). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null
to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a
specifies the conversion of a Haskell value of type a
to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB
to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding
in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With