Setting
I need to traverse a directory over 100+ .txt files, open every one and do some function on each, then combine the results. These files are huge, on the order of 10GB. Some common operation in psuedocode might be:
foldr concatFile mempty $ openFile <$> [filePath1, ..., filePathn]
foldr countStuff 0 $ openFile <$> [filePath1, ..., filePathn]
The trick is to make sure all the files never exist in memory at the same time, my previous naive solution created all kinds of swap files on my mac. In addition, if one of the filePath is invalid, I'd like to just skip over it and go on with the program.
My Solution
Currently I'm using conduit and would like to find a solution using conduit if possible. But if it's not the right tool I'm fine with using something else.
You can nest conduit execution like this:
{-# LANGUAGE OverloadedStrings #-}
import Conduit
import qualified Data.ByteString as BS
-- Process a single file
processFile :: FilePath -> IO ()
processFile path = runResourceT (sourceFile path =$= mapC BS.length $$ sumC) >>= print
-- Run processFile for directory in a tree
doit :: FilePath -> IO ()
doit top = runResourceT $ sourceDirectoryDeep False top $$ mapM_C (liftIO . processFile)
Replace processFile
with whatever you want to do -- including
ignoring the file. My understanding is that the sourceFile
Producer will efficiently chunk
the contents of a file.
And, according to this Yesod article, sourceDirectoryDeep
should
efficiently traverse a directory structure.
The thing you apparently can't do with sourceDirectoryDeep
is prune
directories.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With