Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell Regex performance

I've been looking at the existing options for regex in Haskell, and I wanted to understand where the gap in performance came from when comparing the various options with each other and especially with a simple call to grep...

I have a relatively small (~ 110M, compared to a usual several 10s of G in most of my use cases) trace file :

$ du radixtracefile
113120 radixtracefile
$ wc -l radixtracefile
1051565 radixtracefile

  • I first tried to find how many matches of the (arbitrary) pattern .*504.*ll were in there through grep :
$ time grep -nE ".*504.*ll" radixtracefile | wc -l
309

real   0m0.211s
user   0m0.202s
sys    0m0.010s

  • I looked at Text.Regex.TDFA (version 1.2.1) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Regex.TDFA
import qualified Data.ByteString as B

main = do
    f <- B.readFile "radixtracefile"
    matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
    mapM_ (putStrLn . show . head) matches

Building and running :

$ ghc -O2 test-TDFA.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-TDFA.hs, test-TDFA.o )
Linking test-TDFA ...
$ time ./test-TDFA | wc -l
309

real   0m4.463s
user   0m4.431s
sys    0m0.036s

  • Then, I looked at Data.Text.ICU.Regex (version 0.7.0.1) with Unicode support:
import Control.Monad.Loops
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Text.ICU.Regex

main = do
    re <- regex [] $ T.pack ".*504.*ll"
    f <- TIO.readFile "radixtracefile"
    setText re f
    whileM_ (findNext re) $ do
        a <- start re 0
        putStrLn $ "last match at :"++(show a)

Building and running :

$ ghc -O2 test-ICU.hs
[1 of 1] Compiling Main             ( test-ICU.hs, test-ICU.o )
Linking test-ICU ...
$ time ./test-ICU | wc -l
309

real   1m36.407s
user   1m36.090s
sys    0m0.169s

I use ghc version 7.6.3. I haven't had the occasion of testing other Haskell regex options. I knew that I would not get the performance that I had with grep and was more than happy with that, but more or less 20 times slower for the TDFA and ByteString... That is very scary. And I can't really understand why it is what it is, as I naively though that this was a wrapper on a native backend... Am I somehow not using the module correctly ?

(And let's not mention the ICU + Text combo which is going through the roof)

Is there an option that I haven't tested yet that would make me happier ?

EDIT :

  • Text.Regex.PCRE (version 0.94.4) with Data.ByteString :
import Control.Monad.Loops
import Data.Maybe
import Text.Regex.PCRE
import qualified Data.ByteString as B

main = do
    f <- B.readFile "radixtracefile"
    matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
    mapM_ (putStrLn . show . head) matches

Building and running :

$ ghc -O2 test-PCRE.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-PCRE.hs, test-PCRE.o )
Linking test-PCRE ...
$ time ./test-PCRE | wc -l
309

real   0m1.442s
user   0m1.412s
sys    0m0.031s

Better, but still with a factor of ~7-ish ...

like image 330
gameboo Avatar asked Nov 08 '22 21:11

gameboo


1 Answers

So, after looking at other libraries for a bit, I ended up trying PCRE.Ligth (version 0.4.0.4) :

import Control.Monad
import Text.Regex.PCRE.Light
import qualified Data.ByteString.Char8 as B

main = do
    f <- B.readFile "radixtracefile"
    let lines = B.split '\n' f
    let re = compile (B.pack ".*504.*ll") []
    forM_ lines $ \l -> maybe (return ()) print $ match re l []

Here is what I get out of that :

$ ghc -O2 test-PCRELight.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-PCRELight.hs, test-PCRELight.o )
Linking test-PCRELight ...
$ time ./test-PCRELight | wc -l
309

real   0m0.832s
user   0m0.803s
sys    0m0.027s

I think this is decent enough for my purposes. I might try to see what happens with the other libs when I manually do the line splitting like I did here, although I doubt it's going to make a big difference.

like image 189
gameboo Avatar answered Nov 15 '22 04:11

gameboo