I've written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep
. All is fine and dandy except if the current directory has a very large number of files, in which case I get an error like:
<program>: <currentFile>: openBinaryFile: resource exhausted (Too many open files)
It seems Haskell's laziness is causing it not to close files, even after its corresponding line of output has been completed.
The relevant code is below. The function of interest is getList
.
import qualified Data.ByteString.Lazy as BS
main :: IO ()
main = putStr . unlines =<< getList "."
getList :: FilePath -> IO [String]
getList p =
let getFileLine path = liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
in mapM getFileLine =<< getRecursiveContents p
hex :: [Word8] -> String
hex = concatMap (\x -> printf "%0.2x" (toInteger x))
getRecursiveContents :: FilePath -> IO [FilePath]
-- ^ Just gets the paths to all the files in the given directory.
Are there any ideas on how I could solve this problem?
The entire program is available here: http://haskell.pastebin.com/PAZm0Dcb
Edit: I have plenty of files that don't fit into RAM, so I am not looking for a solution that reads the entire file into memory at once.
You don't need to use any special way of doing IO, you just need to change the order in which you do things. So instead of opening all files and then processing the content, you open one file and print one line of output at a time.
import Data.Digest.Pure.MD5 (md5)
import qualified Data.ByteString.Lazy as BS
main :: IO ()
main = mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)
=<< getRecursiveContents "."
fileLine :: FilePath -> BS.ByteString -> String
fileLine path c = hash c ++ " " ++ path
hash :: BS.ByteString -> String
hash = show . md5
BTW, I happen to be using a different md5 hash lib, the difference is not significant.
The main thing that is going on here is the line:
mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)
It's opening a single file, it's consuming the whole content of the file and printing one line of output. It closes the file because it's consuming the whole content of the file. Previously you were delaying when the file was consumed which delayed when the file was closed.
If you are not quite sure if you are consuming all the input but want to make sure the file gets closed anyway, then you can use the withFile
function from System.IO
:
mapM_ (\path -> withFile path ReadMode $ \hnd -> do
c <- BS.hGetContents hnd
putStrLn (fileLine path c))
The withFile
function opens the file and passes the file handle to the body function. It guarantees that the file gets closed when the body returns. This "withBlah" pattern is very common when dealing with expensive resources. This resource pattern is directly supported by System.Exception.bracket
.
Lazy IO is very bug-prone.
As dons suggested, you should use strict IO.
You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.
import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)
hashFile :: FilePath -> IO BS.ByteString
hashFile =
fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024
strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
takeWhile (not . BS.null) $ do
handle <- liftIO $ openFile filename ReadMode
repeat () -- this makes the lines below loop
chunk <- liftIO $ BS.hGet handle chunkSize
when (BS.null chunk) . liftIO $ hClose handle
return chunk
I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.
Monadic lists/ListT
come from the "List" package on hackage (transformers' and mtl's ListT
are broken and also don't come with useful functions like takeWhile
)
NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.
Here's my quick attempt at an Iteratee
-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.
For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee
well enough yet to be able to tell why easily.
module Main where
import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8)
import System.Directory
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS
import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW
evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext
iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
where
chunk s@(I.EOF Nothing) =
get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
chunk (I.Chunk c) = do
modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
return $ I.Cont (I.IterateeG chunk) Nothing
fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path
main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path)
=<< getRecursiveContents "."
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
names <- getDirectoryContents topdir
let properNames = filter (`notElem` [".", ".."]) names
paths <- concatForM properNames $ \name -> do
let path = topdir </> name
isDirectory <- doesDirectoryExist path
if isDirectory
then getRecursiveContents path
else do
isFile <- doesFileExist path
if isFile
then return [path]
else return []
return (sort paths)
concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)
Note that you'll need the iteratee
package and TomMD's pureMD5
. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)
Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.
Well, you'll need to use a different IO mechanism. Either:
I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.
For example, you can replace your lazy IO with System.IO.Strict, yielding:
import qualified System.IO.Strict as S
getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
where
getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
(S.readFile path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With