Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

breadth-first traversal of directory tree is not lazy

Tags:

haskell

I try to traverse the directory tree. A naive depth-first traversal seems not to produce the data in a lazy fashion and runs out of memory. I next tried a breadth first approach, which shows the same problem - it uses all the memory available and then crashes.

the code I have is:

getFilePathBreadtFirst :: FilePath -> IO [FilePath]
getFilePathBreadtFirst fp = do
  fileinfo <- getInfo fp 
  res :: [FilePath]  <- if isReadableDirectory fileinfo
          then do
                children  <- getChildren fp 
                lower    <-  mapM getFilePathBreadtFirst children  
                return (children ++  concat lower)
           else return [fp]        -- should only return the files? 
  return res 

getChildren :: FilePath -> IO [FilePath]
getChildren path = do 
          names <- getUsefulContents path
          let namesfull = map (path </>) names
          return namesfull

testBF fn = do  -- crashes for /home/frank, does not go to swap 
  fps <- getFilePathBreadtFirst fn
  putStrLn $ unlines fps

I think all the code is either linear or tail recursive, and I would expect that the listing of filenames starts immediately, but in fact it does not. Where is the error in my code and my thinking? Where have I lost lazy evaluation?

like image 202
user855443 Avatar asked Sep 26 '12 21:09

user855443


1 Answers

I will use three separate tricks to solve your question.

  • Trick 1: Use the pipes library to stream file names concurrent with traversing the tree.
  • Trick 2: Use the StateT (Seq FilePath) transformer to achieve a breadth-first traversal.
  • Trick 3: Use the MaybeT transformer to avoid manual recursion when writing the loop and exit.

The following code combines these three tricks in one monad transformer stack.

import Control.Monad
import Control.Monad.Trans
import Control.Monad.Trans.Maybe
import Control.Monad.State.Lazy
import Control.Pipe
import Data.Sequence
import System.FilePath.Posix
import System.Directory

loop :: (Monad m) => MaybeT m a -> m ()
loop = liftM (maybe () id) . runMaybeT . forever

quit :: (Monad m) => MaybeT m a
quit = mzero

getUsefulContents :: FilePath -> IO [FilePath]
getUsefulContents path
  = fmap (filter (`notElem` [".", ".."])) $ getDirectoryContents path

permissible :: FilePath -> IO Bool
permissible file
  = fmap (\p -> readable p && searchable p) $ getPermissions file

traverseTree :: FilePath -> Producer FilePath IO ()
traverseTree path = (`evalStateT` empty) $ loop $ do
    -- All code past this point uses the following monad transformer stack:
    -- MaybeT (StateT (Seq FilePath) (Producer FilePath IO)) ()
    let liftState = lift
        liftPipe  = lift . lift
        liftIO    = lift . lift . lift
    liftState $ modify (|> path)
    forever $ do
        x <- liftState $ gets viewl
        case x of
            EmptyL    -> quit
            file :< s -> do
                liftState $ put s
                liftPipe $ yield file
                p <- liftIO $ doesDirectoryExist file
                when p $ do
                    names <- liftIO $ getUsefulContents file
                    -- allowedNames <- filterM permissible names
                    let namesfull = map (path </>) names
                    liftState $ forM_ namesfull $ \name -> modify (|> name)

This creates a generator of breadth-first file names that can be consumed concurrent with the tree traversal. You consume the values using:

printer :: (Show a) => Consumer a IO r
printer = forever $ do
    a <- await
    lift $ print a

>>> runPipe $ printer <+< traverseTree path
<Prints file names as it traverses the tree>

You can even choose to not demand all the values:

-- Demand only 'n' elements
take' :: (Monad m) => Int -> Pipe a a m ()
take' n = replicateM_ n $ do
    a <- await
    yield a

>> runPipe $ printer <+< take' 3 <+< traverseTree path
<Prints only three files>

More importantly, that last example will only traverse the tree as much as necessary to generate the three files and then it will stop. This prevents wastefully traversing the entire tree when all you wanted was 3 results!

To learn more about the pipes library trick, consult the pipes tutorial at Control.Pipes.Tutorial.

To learn more about the loop trick, read this blog post.

I couldn't find a good link for the queue trick for breadth first traversal, but I know it's out there somewhere. If somebody else knows a good link for this, just edit my answer to add it.

like image 139
Gabriella Gonzalez Avatar answered Oct 20 '22 18:10

Gabriella Gonzalez