Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell lazy I/O and closing files

I've written a small Haskell program to print the MD5 checksums of all files in the current directory (searched recursively). Basically a Haskell version of md5deep. All is fine and dandy except if the current directory has a very large number of files, in which case I get an error like:

<program>: <currentFile>: openBinaryFile: resource exhausted (Too many open files)

It seems Haskell's laziness is causing it not to close files, even after its corresponding line of output has been completed.

The relevant code is below. The function of interest is getList.

import qualified Data.ByteString.Lazy as BS

main :: IO ()
main = putStr . unlines =<< getList "."

getList :: FilePath -> IO [String]
getList p =
    let getFileLine path = liftM (\c -> (hex $ hash $ BS.unpack c) ++ " " ++ path) (BS.readFile path)
    in mapM getFileLine =<< getRecursiveContents p

hex :: [Word8] -> String
hex = concatMap (\x -> printf "%0.2x" (toInteger x))

getRecursiveContents :: FilePath -> IO [FilePath]
-- ^ Just gets the paths to all the files in the given directory.

Are there any ideas on how I could solve this problem?

The entire program is available here: http://haskell.pastebin.com/PAZm0Dcb

Edit: I have plenty of files that don't fit into RAM, so I am not looking for a solution that reads the entire file into memory at once.

like image 560
Jesse Avatar asked Jun 05 '10 18:06

Jesse


4 Answers

You don't need to use any special way of doing IO, you just need to change the order in which you do things. So instead of opening all files and then processing the content, you open one file and print one line of output at a time.

import Data.Digest.Pure.MD5 (md5)
import qualified Data.ByteString.Lazy as BS

main :: IO ()
main = mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path) 
   =<< getRecursiveContents "."

fileLine :: FilePath -> BS.ByteString -> String
fileLine path c = hash c ++ " " ++ path

hash :: BS.ByteString -> String 
hash = show . md5

BTW, I happen to be using a different md5 hash lib, the difference is not significant.

The main thing that is going on here is the line:

mapM_ (\path -> putStrLn . fileLine path =<< BS.readFile path)

It's opening a single file, it's consuming the whole content of the file and printing one line of output. It closes the file because it's consuming the whole content of the file. Previously you were delaying when the file was consumed which delayed when the file was closed.

If you are not quite sure if you are consuming all the input but want to make sure the file gets closed anyway, then you can use the withFile function from System.IO:

mapM_ (\path -> withFile path ReadMode $ \hnd -> do
                  c <- BS.hGetContents hnd
                  putStrLn (fileLine path c))

The withFile function opens the file and passes the file handle to the body function. It guarantees that the file gets closed when the body returns. This "withBlah" pattern is very common when dealing with expensive resources. This resource pattern is directly supported by System.Exception.bracket.

like image 68
Duncan Coutts Avatar answered Nov 05 '22 22:11

Duncan Coutts


Lazy IO is very bug-prone.

As dons suggested, you should use strict IO.

You can use a tool such as Iteratee to help you structure strict IO code. My favorite tool for this job is monadic lists.

import Control.Monad.ListT (ListT) -- List
import Control.Monad.IO.Class (liftIO) -- transformers
import Data.Binary (encode) -- binary
import Data.Digest.Pure.MD5 -- pureMD5
import Data.List.Class (repeat, takeWhile, foldlL) -- List
import System.IO (IOMode(ReadMode), openFile, hClose)
import qualified Data.ByteString.Lazy as BS
import Prelude hiding (repeat, takeWhile)

hashFile :: FilePath -> IO BS.ByteString
hashFile =
    fmap (encode . md5Finalize) . foldlL md5Update md5InitialContext . strictReadFileChunks 1024

strictReadFileChunks :: Int -> FilePath -> ListT IO BS.ByteString
strictReadFileChunks chunkSize filename =
    takeWhile (not . BS.null) $ do
        handle <- liftIO $ openFile filename ReadMode
        repeat () -- this makes the lines below loop
        chunk <- liftIO $ BS.hGet handle chunkSize
        when (BS.null chunk) . liftIO $ hClose handle
        return chunk

I used the "pureMD5" package here because "Crypto" doesn't seem to offer a "streaming" md5 implementation.

Monadic lists/ListT come from the "List" package on hackage (transformers' and mtl's ListT are broken and also don't come with useful functions like takeWhile)

like image 25
yairchu Avatar answered Nov 05 '22 23:11

yairchu


NOTE: I've edited my code slightly to reflect the advice in Duncan Coutts's answer. Even after this edit his answer is obviously much better than mine, and doesn't seem to run out of memory in the same way.


Here's my quick attempt at an Iteratee-based version. When I run it on a directory with about 2,000 small (30-80K) files it's about 30 times faster than your version here and seems to use a bit less memory.

For some reason it still seems to run out of memory on very large files—I don't really understand Iteratee well enough yet to be able to tell why easily.

module Main where

import Control.Monad.State
import Data.Digest.Pure.MD5
import Data.List (sort)
import Data.Word (Word8) 
import System.Directory 
import System.FilePath ((</>))
import qualified Data.ByteString.Lazy as BS

import qualified Data.Iteratee as I
import qualified Data.Iteratee.WrappedByteString as IW

evalIteratee path = evalStateT (I.fileDriver iteratee path) md5InitialContext

iteratee :: I.IterateeG IW.WrappedByteString Word8 (StateT MD5Context IO) MD5Digest
iteratee = I.IterateeG chunk
  where
    chunk s@(I.EOF Nothing) =
      get >>= \ctx -> return $ I.Done (md5Finalize ctx) s
    chunk (I.Chunk c) = do
      modify $ \ctx -> md5Update ctx $ BS.fromChunks $ (:[]) $ IW.unWrap c
      return $ I.Cont (I.IterateeG chunk) Nothing

fileLine :: FilePath -> MD5Digest -> String
fileLine path c = show c ++ " " ++ path

main = mapM_ (\path -> putStrLn . fileLine path =<< evalIteratee path) 
   =<< getRecursiveContents "."

getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topdir = do
  names <- getDirectoryContents topdir

  let properNames = filter (`notElem` [".", ".."]) names

  paths <- concatForM properNames $ \name -> do
    let path = topdir </> name

    isDirectory <- doesDirectoryExist path
    if isDirectory
      then getRecursiveContents path
      else do
        isFile <- doesFileExist path
        if isFile
          then return [path]
          else return []

  return (sort paths)

concatForM :: (Monad m) => [a1] -> (a1 -> m [a]) -> m [a]
concatForM xs f = liftM concat (forM xs f)

Note that you'll need the iteratee package and TomMD's pureMD5. (And my apologies if I've done something horrifying here—I'm a beginner with this stuff.)

like image 6
Travis Brown Avatar answered Nov 05 '22 22:11

Travis Brown


Edit: my assumption was that the user was opening thousands of very small files, it turns out they are very large. Laziness will be essential.

Well, you'll need to use a different IO mechanism. Either:

  • Strict IO (process the files with Data.ByteString or System.IO.Strict
  • or, Iteratee IO (for experts only at the moment).

I'd also strongly recommend not using 'unpack', as that destroys the benefit of using bytestrings.

For example, you can replace your lazy IO with System.IO.Strict, yielding:

import qualified System.IO.Strict as S

getList :: FilePath -> IO [String]
getList p = mapM getFileLine =<< getRecursiveContents p
    where
        getFileLine path = liftM (\c -> (hex (hash c)) ++ " " ++ path)
                                 (S.readFile path)
like image 3
Don Stewart Avatar answered Nov 05 '22 23:11

Don Stewart