I have a few situations where I need to list files recursively, but my implementations have been slow. I have a directory structure with 92784 files. find
lists the files in less than 0.5 seconds, but my Haskell implementation is a lot slower.
My first implementation took a bit over 9 seconds to complete, next version a bit over 5 seconds and I'm currently down to a bit less than two seconds.
listFilesR :: FilePath -> IO [FilePath]
listFilesR path = let
isDODD "." = False
isDODD ".." = False
isDODD _ = True
in do
allfiles <- getDirectoryContents path
dirs <- forM allfiles $ \d ->
if isDODD d then
do let p = path </> d
isDir <- doesDirectoryExist p
if isDir then listFilesR p else return [d]
else return []
return $ concat dirs
The test takes about 100 megabytes of memory (+RTS -s), and the program spends around 40% in GC.
I was thinking of doing the listing in a WriterT monad with Sequence as the monoid to prevent the concats and list creation. Is it likely this helps? What else should I do?
Edit: I have edited the function to use readDirStream, and it helps keeping the memory down. There's still some allocation happening, but productivity rate is >95% now and it runs in less than a second.
This is the current version:
list path = do
de <- openDirStream path
readDirStream de >>= go de
closeDirStream de
where
go d [] = return ()
go d "." = readDirStream d >>= go d
go d ".." = readDirStream d >>= go d
go d x = let newpath = path </> x
in do
e <- doesDirectoryExist newpath
if e
then
list newpath >> readDirStream d >>= go d
else putStrLn newpath >> readDirStream d >>= go d
rsync in this benchmark case is faster than rm -rf : web.archive.org/web/20130929001850/http://linuxnote.net/… Great explanation. Magma is liquid hot by definition. It's still a great example of a better file destruction method.
The best way to speed up find is indeed by using xargs in place of -exec , but also including the -P option in your command, which will instruct xargs to use multiple CPU cores.
I think that System.Directory.getDirectoryContents
constructs a whole list and therefore uses much memory. How about using System.Posix.Directory
? System.Posix.Directory.readDirStream
returns an entry one by one.
Also, FileManip library might be useful although I have never used it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With