I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with). This was my code: <pre class="prettyprint"><code>{-# LANGUAGE OverloadedStrings #-} import Control.Arrow import qualified Data.Map as M import Control.Applicative import qualified Data.Text as DT import qualified Data.Text.Lazy.IO as DTLIO import qualified Data.Text.Lazy as DTL import qualified Data.Attoparsec.Text.Lazy as ATL import Data.Monoid text = do x <- DTLIO.readFile "../../../../etymwn.tsv" return $ DTL.take 10000 x --parsers wordpair = do x <- ATL.takeTill (== ':') ATL.char ':' *> (ATL.many' $ ATL.char ' ') y <- ATL.takeTill (\x -> x `elem` ['\t','\n']) ATL.char '\n' <|> ATL.char '\t' return (x,y) --line of file line = do a <- (ATL.count 3 wordpair) case (rel (a !! 2)) of True -> return . (\[a,b,c] -> [(a,c)]) $ a False -> return . (\[a,b,c] -> [(c,a)]) $ a where rel x = if x == ("rel","etymological_origin_of") then False else True tsv = do x <- ATL.many1 line return $ fmap M.fromList x main = (putStrLn . show . ATL.parse tsv) =<< text </code></pre> It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with <pre class="prettyprint"><code>foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv </code></pre> So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code? Thanks, Reuben

Note that the file you want to load has 6 million lines and the text you are interested in storing comprises approx. 120 MB. <h3>Lower Bounds</h3> To establish some lower bounds I first created another .tsv file containing the preprocessed contents of the etymwn.tsv file. I then timed how it took for this perl program to read that file: <pre class="prettyprint"><code>my %H; while (<>) { chomp; my ($a,$b) = split("\t", $_, 2); $H{$a} = $b; } </code></pre> This took approx. 17 secs., so I would expect any Haskell program to take about that about of time. If this start-up time is unacceptable, consider the following options: <ol> <li>Work in ghci and use the "live reloading" technique to save the map using the Foreign.Store package so that it persists through ghci code reloads. That way you only have to load the map data once as you iterate your code.</li> <li>Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)</li> <li>Access the data through a client-server store</li> <li>Reduce the number of key-value pairs you store (do you need all 6 million?)</li> </ol> Option 1 is discussed in this blog post by Chris Done: <ul> <li>Reload Running Code in GHCI</li> </ul> Options 2 and 3 will require you to work in the IO monad. <h3>Parsing</h3> First of all, check the type of your <code>tsv</code> function: <pre class="prettyprint"><code>tsv :: Data.Attoparsec.Internal.Types.Parser DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)] </code></pre> You are returning a list of maps instead of just one map. This doesn't look right. Secondly, as @chi suggested, I doubt that using <code>attoparsec</code> is lazy. In partcular, it has to verify that the entire parse succeeds, so I can't see how it cannot avoid creating all of the parsed lines before returning. To truely parse the input lazily, take the following approach: <pre class="prettyprint"><code>toPair :: DT.Text -> (Key, Value) toPair input = ... main = do all_lines <- fmap DTL.lines $ DTLIO.getContent let m = M.fromList $ map toPair all_lines print $ M.lookup "foobar" m </code></pre> You can still use <code>attoparsec</code> to implement <code>toPair</code>, but you'll be using it on a line-by-line basis instead of on the entire input. <h3>ByteString vs. Text</h3> In my experience working with ByteStrings is much faster than working with Text. This version of <code>toPair</code> for ByteStrings is about 4 times faster than the corresponding version for Text: <pre class="prettyprint"><code>{-# LANGUAGE OverloadedStrings #-} import qualified Data.ByteString.Lazy.Char8 as L import qualified Data.Attoparsec.ByteString.Char8 as A import qualified Data.Attoparsec.ByteString.Lazy as AL toPair :: L.ByteString -> (L.ByteString, L.ByteString) toPair bs = case AL.maybeResult (AL.parse parseLine bs) of Nothing -> error "bad line" Just (a,b) -> (a,b) where parseLine = do A.skipWhile (/= ' ') A.skipWhile (== ' ') a <- A.takeWhile (/= '\t') A.skipWhile (== '\t') rel <- A.takeWhile (/= '\t') A.skipWhile (== '\t') A.skipWhile (/= ' ') A.skipWhile (== ' ') c <- A.takeWhile (const True) if rel == "rel:etymological_origin_of" then return (c,a) else return (a,c) </code></pre> Or, just use plain ByteString functions: <pre class="prettyprint"><code>fields :: L.ByteString -> [L.ByteString] fields = L.splitWith (== '\t') snipSpace = L.ByteString -> L.ByteString snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ') toPair'' bs = let fs = fields bs case fields line of (x:y:z:_) -> let a = snipSpace x c = snipSpace z in if y == "rel:etymological_origin_of" then (c,a) else (a,c) _ -> error "bad line" </code></pre> Most of the time spent loading the map is in parsing the lines. For ByteStrings this is about 14 sec. to load all 6 million lines vs. 50 secs. for Text.

efficiently reading a large file into a Map

Tags:

performance

io

hashmap

haskell

attoparsec

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).

This was my code:

{-# LANGUAGE OverloadedStrings #-}

import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid

text = do
    x <- DTLIO.readFile "../../../../etymwn.tsv"
    return $ DTL.take 10000 x

--parsers
wordpair = do
    x <- ATL.takeTill (== ':')
    ATL.char ':' *> (ATL.many' $ ATL.char ' ')
    y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
    ATL.char '\n' <|>   ATL.char '\t'
    return (x,y)

--line of file
line = do
    a <- (ATL.count 3 wordpair)
    case (rel (a !! 2)) of 
        True -> return . (\[a,b,c] -> [(a,c)]) $ a
        False -> return . (\[a,b,c] -> [(c,a)]) $ a
    where rel x = if x == ("rel","etymological_origin_of") then False else True

tsv = do 
    x <- ATL.many1 line
    return $ fmap M.fromList x

main = (putStrLn . show . ATL.parse tsv) =<< text

It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with

foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv

So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?

Thanks,

Reuben

725

asked Jul 29 '15 10:07

Reuben

1 Answers

Note that the file you want to load has 6 million lines and the text you are interested in storing comprises approx. 120 MB.

Lower Bounds

To establish some lower bounds I first created another .tsv file containing the preprocessed contents of the etymwn.tsv file. I then timed how it took for this perl program to read that file:

my %H;
while (<>) {
  chomp;
  my ($a,$b) = split("\t", $_, 2);
  $H{$a} = $b;
}

This took approx. 17 secs., so I would expect any Haskell program to take about that about of time.

If this start-up time is unacceptable, consider the following options:

Work in ghci and use the "live reloading" technique to save the map using the Foreign.Store package so that it persists through ghci code reloads. That way you only have to load the map data once as you iterate your code.
Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
Access the data through a client-server store
Reduce the number of key-value pairs you store (do you need all 6 million?)

Option 1 is discussed in this blog post by Chris Done:

Reload Running Code in GHCI

Options 2 and 3 will require you to work in the IO monad.

Parsing

First of all, check the type of your tsv function:

tsv :: Data.Attoparsec.Internal.Types.Parser
          DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]

You are returning a list of maps instead of just one map. This doesn't look right.

Secondly, as @chi suggested, I doubt that using attoparsec is lazy. In partcular, it has to verify that the entire parse succeeds, so I can't see how it cannot avoid creating all of the parsed lines before returning.

To truely parse the input lazily, take the following approach:

toPair :: DT.Text -> (Key, Value)
toPair input = ...

main = do
  all_lines <- fmap DTL.lines $ DTLIO.getContent
  let m = M.fromList $ map toPair all_lines
  print $ M.lookup "foobar" m

You can still use attoparsec to implement toPair, but you'll be using it on a line-by-line basis instead of on the entire input.

ByteString vs. Text

In my experience working with ByteStrings is much faster than working with Text.

This version of toPair for ByteStrings is about 4 times faster than the corresponding version for Text:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL

toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
  case AL.maybeResult (AL.parse parseLine bs) of
    Nothing    -> error "bad line"
    Just (a,b) -> (a,b)
  where parseLine = do
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          a <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          rel <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          c <- A.takeWhile (const True)
          if rel == "rel:etymological_origin_of"
            then return (c,a)
            else return (a,c)

Or, just use plain ByteString functions:

fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')

snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')

toPair'' bs = 
  let fs = fields bs
  case fields line of
    (x:y:z:_) -> let a = snipSpace x
                     c = snipSpace z
                 in
                 if y == "rel:etymological_origin_of"
                   then (c,a)
                   else (a,c)
    _         -> error "bad line"

Most of the time spent loading the map is in parsing the lines. For ByteStrings this is about 14 sec. to load all 6 million lines vs. 50 secs. for Text.

134

answered Sep 22 '22 12:09

ErikR

Related questions
                            
                                Using scala parallelism when iterating over a java converted List to immutable
                            
                                DataTables: Speed up server side processing with multiple tables, grouping, and html content?
                            
                                Why does eclipse flicker and is slow to load XML in Editors?
                            
                                Performance of ArrayList
                            
                                Fast and save way to remove the sign of a singed number in JavaScript
                            
                                REST API measuring server-side response times (performance).
                            
                                Django Query extremely slow
                            
                                How do you determine the correct local font names when preloading webfonts?
                            
                                128-bit struct or 2 64-bit records for performance and readibility
                            
                                jQuery mousemove performance - throttle events?
                            
                                large-scale document co-occurrence analysis
                            
                                Spawn WebWorker when needed or reuse the same one?
                            
                                Intel compiler produces code 68% slower than MSVC (full example provided)
                            
                                Will unused default arguments decrease performance c++
                            
                                How to optimize opening and closing excel workbooks to extract data to run faster
                            
                                How to clear chrome performance entries or bypass the limit on their number?
                            
                                ARIA attributes in ember core form components
                            
                                How would I improve the performance in the following scenario
                            
                                Performance considerations with `Promise.all` and a large amount of asynchronous operations
                            
                                Neo4j Match / Retrieving Query taking too much time 25 sec

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

efficiently reading a large file into a Map

Tags:

performance

io

hashmap

haskell

attoparsec

Reuben

People also ask

1 Answers

Lower Bounds

Parsing

ByteString vs. Text

ErikR

Recent Activity

Donate For Us