Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing ASCII file efficiently in Haskell

I wanted to reimplement some of my ASCII parsers in Haskell since I thought I could gain some speed. However, even a simple "grep and count" is much slower than a sloppy Python implementation.

Can someone explain me why and how to do it correctly?

So the task is, count the lines which starts with the string "foo".

My very basic Python implementation:

with open("foo.txt", 'r') as f:
    print len([line for line in f.readlines() if line.startswith('foo')])

And the Haskell version:

import System.IO 
import Data.List

countFoos :: String -> Int
countFoos str = length $ filter (isPrefixOf "foo") (lines str)

main = do
    contents <- readFile "foo.txt"
    putStr (show $ countFoos contents)

Running both with time on a ~600MB file with 17001895 lines reveals that the Python implementation is almost 4 times faster than the Haskell one (running on my MacBook Pro Retina 2015 with PCIe SSD):

> $ time ./FooCounter                                                           
1770./FooCounter  20.92s user 0.62s system 98% cpu 21.858 total

> $ time python foo_counter.py                                                   
1770
python foo_counter.py  5.19s user 1.01s system 97% cpu 6.332 total

Compared to unix command line tools:

> $ time grep -c foo foo.txt                                                     
1770
grep -c foo foo.txt   4.87s user 0.10s system 99% cpu 4.972 total

> $ time fgrep -c foo foo.txt                                                     
1770
fgrep -c foo foo.txt  6.21s user 0.10s system 99% cpu 6.319 total

> $ time egrep -c foo foo.txt                                                     
1770
egrep -c foo foo.txt  6.21s user 0.11s system 99% cpu 6.317 total

Any ideas?

UPDATE:

Using András Kovács' implementation (ByteString), I got it under half a second!

> $ time ./FooCounter                                                                                                               
1770
./EvtReader  0.47s user 0.48s system 97% cpu 0.964 total
like image 834
tamasgal Avatar asked Jul 16 '15 07:07

tamasgal


2 Answers

I benchmarked the following solution:

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Char8 as B

main =
  print . length . filter (B.isPrefixOf "foo") . B.lines =<< B.readFile "test.txt"

text.txt is a 170 MB file with 8 million lines, and half of the lines start with "foo". I compiled with GHC 7.10 and -O2 -fllvm.

The ByteString version ran in 0.27 seconds, while the original version ran in 5.16 seconds.

However, the strict ByteString version uses 170 MB memory loading the full file. Changing the import to Data.ByteString.Lazy.Char8 I got 0.39 sec runtime and 1 MB memory usage.

like image 176
András Kovács Avatar answered Nov 16 '22 00:11

András Kovács


Your Haskell version uses the type String to represent the text of the file. String is an alias for [Char] which is a linked list of characters. That's not a good representation for large strings.

Try using the text package instead. It represents strings as arrays (in the Data.Text.* modules) or as linked lists of arrays (in the Data.Text.Lazy.* modules). To port your existing code, you probably want the latter, because I guess you don't want to load the full 600MB file into memory at once. Look in the Data.Text.Lazy and Data.Text.Lazy.IO modules for variants of the readFile, filter, isPrefixOf etc. functions you're using.

If you're sure you only want to support ASCII, you can also look into using the bytestring package instead of the text package.

like image 36
Toxaris Avatar answered Nov 16 '22 00:11

Toxaris