Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I read n files lazily as a single IO operation in Haskell?

How can I read multiple files as a single ByteString lazily with constant memory?

readFiles :: [FilePath] -> IO ByteString

I currently have the following implementation but from what I have seen from profiling as well as my understanding I will end with n-1 of the files in memory.

readFiles = foldl1 joinIOStrings . map ByteString.readFile
    where joinIOStrings ml mr = do
                                l <- ml
                                r <- mr
                                return $ l `ByteString.append` r

I understand that the flaw here is that I am applying the IO actions then rewrapping them so what I think I need is a way to replace the foldl1 joinIOStrings without applying them.

like image 467
Ben Doerr Avatar asked Oct 01 '12 13:10

Ben Doerr


2 Answers

How can I read multiple files as a single ByteString lazily with constant memory?

If you want constant memory usage, you need Data.ByteString.Lazy. A strict ByteString cannot be read lazily, and would require O(sum of filesizes) memory.

For a not too large number of files, simply reading them all (D.B.L.readFile reads lazily) and concatenating the results is good,

import qualified Data.ByteString.Lazy as L

readFiles :: [FilePath] -> IO L.ByteString
readFiles = fmap L.concat . mapM L.readFile

The mapM L.readFile will open the files, but only read the contents of each file when it is demanded.

If the number of files is large, so that the limit of open file handles allowed by the OS for a single process could be exhausted, you need something more complicated. You can cook up your own lazy version of mapM,

import System.IO.Unsafe (unsafeInterleaveIO)

mapM_lazy :: [IO a] -> IO [a]
mapM_lazy [] = return []
mapM_lazy (x:xs) = do
              r <- x
              rs <- unsafeInterleaveIO (mapM_lazy xs)
              return (r:rs)

so that each file will only be opened when its contents are needed, when previously read files can already be closed. There's a slight possibility that that still runs into resource limits, since the time of closing the handles is not guaranteed.

Or you can use your favourite iteratee, enumerator, conduit or whatever package that solves the problem in a systematic way. Each of them has its own advantages and disadvantages with respect to the others and, if coded correctly, eliminates the possibility of accidentally hitting the resource limit.

like image 58
Daniel Fischer Avatar answered Nov 06 '22 11:11

Daniel Fischer


I assume that you are using lazy byte strings (from Data.ByteString.Lazy). There are probably other ways to do this, but one option is to simply use concat :: [ByteString] -> ByteString:

import Control.Monad
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy as ByteString

readFiles :: [FilePath] -> IO ByteString
readFiles = fmap ByteString.concat . mapM ByteString.readFile

(Note: I don't have time to test the code, but reading the documentation says that this should work)

like image 35
dflemstr Avatar answered Nov 06 '22 10:11

dflemstr