I've been looking at the existing options for regex in Haskell, and I wanted to understand where the gap in performance came from when comparing the various options with each other and especially with a simple call to grep...
I have a relatively small (~ 110M, compared to a usual several 10s of G in most of my use cases) trace file :
$ du radixtracefile
113120 radixtracefile
$ wc -l radixtracefile
1051565 radixtracefile
.*504.*ll
were in there through grep : $ time grep -nE ".*504.*ll" radixtracefile | wc -l
309
real 0m0.211s
user 0m0.202s
sys 0m0.010s
import Control.Monad.Loops
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Regex.TDFA
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-TDFA.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-TDFA.hs, test-TDFA.o )
Linking test-TDFA ...
$ time ./test-TDFA | wc -l
309
real 0m4.463s
user 0m4.431s
sys 0m0.036s
import Control.Monad.Loops
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Text.ICU.Regex
main = do
re <- regex [] $ T.pack ".*504.*ll"
f <- TIO.readFile "radixtracefile"
setText re f
whileM_ (findNext re) $ do
a <- start re 0
putStrLn $ "last match at :"++(show a)
Building and running :
$ ghc -O2 test-ICU.hs
[1 of 1] Compiling Main ( test-ICU.hs, test-ICU.o )
Linking test-ICU ...
$ time ./test-ICU | wc -l
309
real 1m36.407s
user 1m36.090s
sys 0m0.169s
I use ghc version 7.6.3. I haven't had the occasion of testing other Haskell regex options. I knew that I would not get the performance that I had with grep and was more than happy with that, but more or less 20 times slower for the TDFA and ByteString... That is very scary. And I can't really understand why it is what it is, as I naively though that this was a wrapper on a native backend... Am I somehow not using the module correctly ?
(And let's not mention the ICU + Text combo which is going through the roof)
Is there an option that I haven't tested yet that would make me happier ?
EDIT :
import Control.Monad.Loops
import Data.Maybe
import Text.Regex.PCRE
import qualified Data.ByteString as B
main = do
f <- B.readFile "radixtracefile"
matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
mapM_ (putStrLn . show . head) matches
Building and running :
$ ghc -O2 test-PCRE.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRE.hs, test-PCRE.o )
Linking test-PCRE ...
$ time ./test-PCRE | wc -l
309
real 0m1.442s
user 0m1.412s
sys 0m0.031s
Better, but still with a factor of ~7-ish ...
So, after looking at other libraries for a bit, I ended up trying PCRE.Ligth (version 0.4.0.4) :
import Control.Monad
import Text.Regex.PCRE.Light
import qualified Data.ByteString.Char8 as B
main = do
f <- B.readFile "radixtracefile"
let lines = B.split '\n' f
let re = compile (B.pack ".*504.*ll") []
forM_ lines $ \l -> maybe (return ()) print $ match re l []
Here is what I get out of that :
$ ghc -O2 test-PCRELight.hs -XScopedTypeVariables
[1 of 1] Compiling Main ( test-PCRELight.hs, test-PCRELight.o )
Linking test-PCRELight ...
$ time ./test-PCRELight | wc -l
309
real 0m0.832s
user 0m0.803s
sys 0m0.027s
I think this is decent enough for my purposes. I might try to see what happens with the other libs when I manually do the line splitting like I did here, although I doubt it's going to make a big difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With