Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <$> getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath] version.
What am I doing wrong? I've never used the List package's ListT outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.
Profiling shows that join (together with doesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:
join x
=> (definition of join in Control.Monad)
x >>= id
=> (definition of >>= in Control.Monad.ListT)
foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
foldrL' mappend mempty x
If in the root directory of the search there are k subdirectories and their contents are already computed in the lists: d1, d2, ... dk, then after applying join you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk). Since x ++ y takes time O(length x) the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1)). If we assume that the number of files is n and they are evenly distributed between d1 ... dk then the time to compute join would be O(n*k) and that is only for the first level of listFiles.
This, I think, is the main performance problem with your solution.
I'm curious, how well does the same program written to use logict work for you? LogicT is semantically the same as ListT, but implemented in continuation-passing style so that it shouldn't have the concat-related type of problems you seem to be running into.
import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <$> getArgs
cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs
fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero
filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
x <- xs
guard $ f x
return x
listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With