I am trying to do a recursive descent of a directory structure using Haskell. I would like to only retrieve the child directories and files as needed (lazily).
I wrote the following code, but when I run it, the trace shows that all directories are visited before the first file:
module Main where
import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )
-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
names <- getDirectoryContents topPath
let
properNames =
filter (`notElem` [".", ".."]) $
trace ("Processing " ++ topPath) names
paths <- forM properNames $ \name -> do
let path = topPath </> name
isDirectory <- doesDirectoryExist path
if isDirectory
then getRecursiveContents path
else return [path]
return (concat paths)
main :: IO ()
main = do
[path] <- getArgs
files <- getRecursiveContents path
forM_ files $ \file -> putStrLn $ "Found file " ++ file
How can I interleave the file processing with the descent? Is the problem that the files <- getRecursiveContents path
action gets performed before the following forM_
in main
?
This is exactly the kind of problem that iteratees/coroutines were designed to solve.
You can easily do this with pipes
. The only change I made to your getRecursiveContents
was to make it a Producer
of FilePath
s and to respond
with the file name instead of returning it. This lets downstream handle the file name immediately instead of waiting for getRecursiveContents
complete.
module Main where
import Control.Monad ( forM_, liftM )
import Control.Proxy
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )
getRecursiveContents :: (Proxy p) => FilePath -> () -> Producer p FilePath IO ()
getRecursiveContents topPath () = runIdentityP $ do
names <- lift $ getDirectoryContents topPath
let properNames = filter (`notElem` [".", ".."]) names
forM_ properNames $ \name -> do
let path = topPath </> name
isDirectory <- lift $ doesDirectoryExist path
if isDirectory
then getRecursiveContents path ()
else respond path
main :: IO ()
main = do
[path] <- getArgs
runProxy $
getRecursiveContents path
>-> useD (\file -> putStrLn $ "Found file " ++ file)
This prints out each file immediately as it traverses the tree, and it does not require lazy IO
. It's also very easy to change what you do with the file names, since all you have to do is switch out the useD
stage with your actual file handling logic.
To learn more about pipes
, I highly recommend you read Control.Proxy.Tutorial.
Using lazy IO / unsafe...
is not a good way to go. Lazy IO causes many problems, including unclosed resources and executing impure actions within pure code. (See also The problem with lazy I/O on Haskell Wiki.)
A safe way is to use some iteratee/enumerator library. (Replacing problematic lazy IO was the motivation for developing these concepts.) Your getRecursiveContents
would become a source of data (AKA enumerator). And the data will be consumed by some iterator. (See also Enumerator and iteratee on Haskell wiki.)
There is a tutorial on the enumerator library that just gives an example of traversing and filtering directory tree, implementing a simple find utility. It implements method
enumDir :: FilePath -> Enumerator FilePath IO b
which is basically just what you need. I believe you will find it interesting.
Also there is a nice article explaining iteratees in The Monad Reader, Issue 16: Iteratee: Teaching an Old Fold New Tricks by John W. Lato, the author of the iteratee library.
Today many people prefer newer libraries such as pipes. You may be interested in a comparison: What are the pros and cons of Enumerators vs. Conduits vs. Pipes?.
Thanks to the comment by Niklas B., here is the solution that I have:
module Main where
import Control.Monad ( forM, forM_, liftM )
import Debug.Trace ( trace )
import System.Directory ( doesDirectoryExist, getDirectoryContents )
import System.Environment ( getArgs )
import System.FilePath ( (</>) )
import System.IO.Unsafe ( unsafeInterleaveIO )
-- From Real World Haskell, p. 214
getRecursiveContents :: FilePath -> IO [FilePath]
getRecursiveContents topPath = do
names <- unsafeInterleaveIO $ getDirectoryContents topPath
let
properNames =
filter (`notElem` [".", ".."]) $
trace ("Processing " ++ topPath) names
paths <- forM properNames $ \name -> do
let path = topPath </> name
isDirectory <- doesDirectoryExist path
if isDirectory
then unsafeInterleaveIO $ getRecursiveContents path
else return [path]
return (concat paths)
main :: IO ()
main = do
[path] <- getArgs
files <- unsafeInterleaveIO $ getRecursiveContents path
forM_ files $ \file -> putStrLn $ "Found file " ++ file
Is there a better way?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With