Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List
package to avoid building the entire list in memory before the printing can start. I implemented this as follows:
module Main where
import Prelude hiding (filter)
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = execute . mapL putStrLn . listFiles =<< head <$> getArgs
listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath]
version.
What am I doing wrong? I've never used the List
package's ListT
outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.
Profiling shows that join
(together with doesDirectoryExists
) accounts for most of the time in your code. Lets see how its definition unfolds:
join x
=> (definition of join in Control.Monad)
x >>= id
=> (definition of >>= in Control.Monad.ListT)
foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
foldrL' mappend mempty x
If in the root directory of the search there are k
subdirectories and their contents are already computed in the lists: d1, d2, ... dk
, then after applying join
you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk)
. Since x ++ y
takes time O(length x)
the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1))
. If we assume that the number of files is n
and they are evenly distributed between d1 ... dk
then the time to compute join
would be O(n*k)
and that is only for the first level of listFiles
.
This, I think, is the main performance problem with your solution.
I'm curious, how well does the same program written to use logict work for you? LogicT
is semantically the same as ListT
, but implemented in continuation-passing style so that it shouldn't have the concat
-related type of problems you seem to be running into.
import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))
main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <$> getArgs
cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs
fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero
filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
x <- xs
guard $ f x
return x
listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
where
valid "." = False
valid ".." = False
valid _ = True
listIfDir False = return path
listIfDir True
= cons path
$ join
$ listFiles
<$> (path </>)
<$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With