Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is my code using monadic lists from the List package so slow?

Last week user Masse asked a question about recursively listing files in a directory in Haskell. My first thought was to try using monadic lists from the List package to avoid building the entire list in memory before the printing can start. I implemented this as follows:

module Main where

import Prelude hiding (filter) 
import Control.Applicative ((<$>))
import Control.Monad (join)
import Control.Monad.IO.Class (liftIO)
import Control.Monad.ListT (ListT)
import Data.List.Class (cons, execute, filter, fromList, mapL)
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = execute . mapL putStrLn . listFiles =<< head <$> getArgs

listFiles :: FilePath -> ListT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))

This works beautifully in that it starts printing immediately and uses very little memory. Unfortunately it's also dozens of times slower than a comparable FilePath -> IO [FilePath] version.

What am I doing wrong? I've never used the List package's ListT outside of toy examples like this, so I don't know what kind of performance to expect, but 30 seconds (vs. a fraction of a second) to process a directory with ~40,000 files seems much too slow.

like image 781
Travis Brown Avatar asked Oct 12 '10 15:10

Travis Brown


2 Answers

Profiling shows that join (together with doesDirectoryExists) accounts for most of the time in your code. Lets see how its definition unfolds:

  join x
=> (definition of join in Control.Monad)
  x >>= id
=> (definition of >>= in Control.Monad.ListT)
  foldrL' mappend mempty (fmap id x)
=> (fmap id = id)
  foldrL' mappend mempty x

If in the root directory of the search there are k subdirectories and their contents are already computed in the lists: d1, d2, ... dk, then after applying join you'll get (roughly): (...(([] ++ d1) ++ d2) ... ++ dk). Since x ++ y takes time O(length x) the whole thing will take time O(d1 + (d1 + d2) + ... + (d1 + ... dk-1)). If we assume that the number of files is n and they are evenly distributed between d1 ... dk then the time to compute join would be O(n*k) and that is only for the first level of listFiles.

This, I think, is the main performance problem with your solution.

like image 130
Daniel Avatar answered Oct 19 '22 17:10

Daniel


I'm curious, how well does the same program written to use logict work for you? LogicT is semantically the same as ListT, but implemented in continuation-passing style so that it shouldn't have the concat-related type of problems you seem to be running into.

import Prelude hiding (filter)
import Control.Applicative
import Control.Monad
import Control.Monad.Logic
import System (getArgs)
import System.Directory (getDirectoryContents, doesDirectoryExist)
import System.FilePath ((</>))

main = sequence_ =<< observeAllT . fmap putStrLn . listFiles =<< head <$> getArgs

cons :: MonadPlus m => a -> m a -> m a
cons x xs = return x `mplus` xs

fromList :: MonadPlus m => [a] -> m a
fromList = foldr cons mzero

filter :: MonadPlus m => (a -> Bool) -> m a -> m a
filter f xs = do
  x <- xs
  guard $ f x
  return x

listFiles :: FilePath -> LogicT IO FilePath
listFiles path = liftIO (doesDirectoryExist path) >>= listIfDir
  where
    valid "."  = False
    valid ".." = False
    valid _ = True
    listIfDir False = return path
    listIfDir True
      =  cons path
      $  join
      $  listFiles
     <$> (path </>)
     <$> (filter valid =<< fromList <$> liftIO (getDirectoryContents path))
like image 2
Reid Barton Avatar answered Oct 19 '22 17:10

Reid Barton