I wanted to reimplement some of my ASCII parsers in Haskell since I thought I could gain some speed. However, even a simple "grep and count" is much slower than a sloppy Python implementation.
Can someone explain me why and how to do it correctly?
So the task is, count the lines which starts with the string "foo".
My very basic Python implementation:
with open("foo.txt", 'r') as f:
print len([line for line in f.readlines() if line.startswith('foo')])
And the Haskell version:
import System.IO
import Data.List
countFoos :: String -> Int
countFoos str = length $ filter (isPrefixOf "foo") (lines str)
main = do
contents <- readFile "foo.txt"
putStr (show $ countFoos contents)
Running both with time
on a ~600MB file with 17001895 lines reveals that the Python implementation is almost 4 times faster than the Haskell one (running on my MacBook Pro Retina 2015 with PCIe SSD):
> $ time ./FooCounter
1770./FooCounter 20.92s user 0.62s system 98% cpu 21.858 total
> $ time python foo_counter.py
1770
python foo_counter.py 5.19s user 1.01s system 97% cpu 6.332 total
Compared to unix command line tools:
> $ time grep -c foo foo.txt
1770
grep -c foo foo.txt 4.87s user 0.10s system 99% cpu 4.972 total
> $ time fgrep -c foo foo.txt
1770
fgrep -c foo foo.txt 6.21s user 0.10s system 99% cpu 6.319 total
> $ time egrep -c foo foo.txt
1770
egrep -c foo foo.txt 6.21s user 0.11s system 99% cpu 6.317 total
Any ideas?
UPDATE:
Using András Kovács' implementation (ByteString
), I got it under half a second!
> $ time ./FooCounter
1770
./EvtReader 0.47s user 0.48s system 97% cpu 0.964 total
I benchmarked the following solution:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as B
main =
print . length . filter (B.isPrefixOf "foo") . B.lines =<< B.readFile "test.txt"
text.txt
is a 170 MB file with 8 million lines, and half of the lines start with "foo". I compiled with GHC 7.10 and -O2 -fllvm
.
The ByteString
version ran in 0.27 seconds, while the original version ran in 5.16 seconds.
However, the strict ByteString
version uses 170 MB memory loading the full file. Changing the import to Data.ByteString.Lazy.Char8
I got 0.39 sec runtime and 1 MB memory usage.
Your Haskell version uses the type String
to represent the text of the file. String
is an alias for [Char]
which is a linked list of characters. That's not a good representation for large strings.
Try using the text package instead. It represents strings as arrays (in the Data.Text.*
modules) or as linked lists of arrays (in the Data.Text.Lazy.*
modules). To port your existing code, you probably want the latter, because I guess you don't want to load the full 600MB file into memory at once. Look in the Data.Text.Lazy
and Data.Text.Lazy.IO
modules for variants of the readFile
, filter
, isPrefixOf
etc. functions you're using.
If you're sure you only want to support ASCII, you can also look into using the bytestring
package instead of the text
package.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With