Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently reading a large file into a Map

I'm trying to write code to perform the following simple task in Haskell: looking up the etymologies of words using this dictionary, stored as a large tsv file (http://www1.icsi.berkeley.edu/~demelo/etymwn/). I thought I'd parse (with attoparsec) the tsv file into a Map, which I could then use to look up etymologies efficiently, as required (and do some other stuff with).

This was my code:

{-# LANGUAGE OverloadedStrings #-}

import Control.Arrow
import qualified Data.Map as M
import Control.Applicative
import qualified Data.Text as DT
import qualified Data.Text.Lazy.IO as DTLIO
import qualified Data.Text.Lazy as DTL
import qualified Data.Attoparsec.Text.Lazy as ATL
import Data.Monoid

text = do
    x <- DTLIO.readFile "../../../../etymwn.tsv"
    return $ DTL.take 10000 x

--parsers
wordpair = do
    x <- ATL.takeTill (== ':')
    ATL.char ':' *> (ATL.many' $ ATL.char ' ')
    y <- ATL.takeTill (\x -> x `elem` ['\t','\n'])
    ATL.char '\n' <|>   ATL.char '\t'
    return (x,y)

--line of file
line = do
    a <- (ATL.count 3 wordpair)
    case (rel (a !! 2)) of 
        True -> return . (\[a,b,c] -> [(a,c)]) $ a
        False -> return . (\[a,b,c] -> [(c,a)]) $ a
    where rel x = if x == ("rel","etymological_origin_of") then False else True

tsv = do 
    x <- ATL.many1 line
    return $ fmap M.fromList x

main = (putStrLn . show . ATL.parse tsv) =<< text

It works for small amounts of input, but quickly grows too inefficient. I'm not quite clear on where the problem is, and soon realized that even trivial tasks like viewing the last character of the file were taking too long when I tried, e.g. with

foo = fmap DTL.last $ DTLIO.readFile "../../../../etymwn.tsv

So my questions are: what are the main things that I'm doing wrong, in terms of approach and execution? Any tips for more Haskelly/better code?

Thanks,

Reuben

like image 725
Reuben Avatar asked Jul 29 '15 10:07

Reuben


People also ask

How can I read large files?

The best way to view extremely large text files is to use… a text editor. Not just any text editor, but the tools meant for writing code. Such apps can usually handle large files without a hitch and are free. Large Text File Viewer is probably the simplest of these applications.

Which of the following is a memory efficient way to read file in Java?

Reading in MemoryreadLines(new File(path), Charsets. UTF_8); FileUtils. readLines(new File(path));

How do I read a large file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.


1 Answers

Note that the file you want to load has 6 million lines and the text you are interested in storing comprises approx. 120 MB.

Lower Bounds

To establish some lower bounds I first created another .tsv file containing the preprocessed contents of the etymwn.tsv file. I then timed how it took for this perl program to read that file:

my %H;
while (<>) {
  chomp;
  my ($a,$b) = split("\t", $_, 2);
  $H{$a} = $b;
}

This took approx. 17 secs., so I would expect any Haskell program to take about that about of time.

If this start-up time is unacceptable, consider the following options:

  1. Work in ghci and use the "live reloading" technique to save the map using the Foreign.Store package so that it persists through ghci code reloads. That way you only have to load the map data once as you iterate your code.
  2. Use a persistent key-value store (such as sqlite, gdbm, BerkeleyDB)
  3. Access the data through a client-server store
  4. Reduce the number of key-value pairs you store (do you need all 6 million?)

Option 1 is discussed in this blog post by Chris Done:

  • Reload Running Code in GHCI

Options 2 and 3 will require you to work in the IO monad.

Parsing

First of all, check the type of your tsv function:

tsv :: Data.Attoparsec.Internal.Types.Parser
          DT.Text [M.Map (DT.Text, DT.Text) (DT.Text, DT.Text)]

You are returning a list of maps instead of just one map. This doesn't look right.

Secondly, as @chi suggested, I doubt that using attoparsec is lazy. In partcular, it has to verify that the entire parse succeeds, so I can't see how it cannot avoid creating all of the parsed lines before returning.

To truely parse the input lazily, take the following approach:

toPair :: DT.Text -> (Key, Value)
toPair input = ...

main = do
  all_lines <- fmap DTL.lines $ DTLIO.getContent
  let m = M.fromList $ map toPair all_lines
  print $ M.lookup "foobar" m

You can still use attoparsec to implement toPair, but you'll be using it on a line-by-line basis instead of on the entire input.

ByteString vs. Text

In my experience working with ByteStrings is much faster than working with Text.

This version of toPair for ByteStrings is about 4 times faster than the corresponding version for Text:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Attoparsec.ByteString.Char8 as A
import qualified Data.Attoparsec.ByteString.Lazy as AL

toPair :: L.ByteString -> (L.ByteString, L.ByteString)
toPair bs =
  case AL.maybeResult (AL.parse parseLine bs) of
    Nothing    -> error "bad line"
    Just (a,b) -> (a,b)
  where parseLine = do
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          a <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          rel <- A.takeWhile (/= '\t')
          A.skipWhile (== '\t')
          A.skipWhile (/= ' ')
          A.skipWhile (== ' ')
          c <- A.takeWhile (const True)
          if rel == "rel:etymological_origin_of"
            then return (c,a)
            else return (a,c)

Or, just use plain ByteString functions:

fields :: L.ByteString -> [L.ByteString]
fields = L.splitWith (== '\t')

snipSpace = L.ByteString -> L.ByteString
snipSpace = L.dropWhile (== ' ') . L.dropWhile (/=' ')

toPair'' bs = 
  let fs = fields bs
  case fields line of
    (x:y:z:_) -> let a = snipSpace x
                     c = snipSpace z
                 in
                 if y == "rel:etymological_origin_of"
                   then (c,a)
                   else (a,c)
    _         -> error "bad line"

Most of the time spent loading the map is in parsing the lines. For ByteStrings this is about 14 sec. to load all 6 million lines vs. 50 secs. for Text.

like image 134
ErikR Avatar answered Sep 22 '22 12:09

ErikR