What is an optimal way to find duplicate files in C++?

Question

I want to find duplicate files on the file system in C++. Is there any algorithm to do that as fast as possible? And do I need to create a multi-threaded application, or I can just use one thread to do it?

frankc · Accepted Answer

I concur with Kerrek SB that there are better tools for this than C++, however, assuming you really need to do this in C++, here are some suggestions and things to consider in your implementation:

use boost::filesystem for portable filesystem traversal
the hash every file suggestion is very reasonable, but it might be more efficient to first make a multimap where the file size is the key. Then only apply the hash when there are files of duplicate size.
decide how you want to treat empty files and symbolic links/short cuts
decied how you want to treat special files, e.g. on unix you have directories fifos, sockets etc
account for the fact that files or directory structure may change, disappear or move while your algorithm is running
account for the fact that some files or directories may be inaccessible or broken (e.g. recursive directory links)
Make the number of threads configurable as the amount of parallelization that makes sense depends on the underlying disk hardware and configuration. It will be different if you are on a simple hard drive vs an expensive san. Don't make assumptions, though; Test it out. For instance, Linux is very good about caching files so many of your reads will come from memory, and thus not block on i/o.

What is an optimal way to find duplicate files in C++?

Tags:

c++

file

algorithm

unresolved_external

1 Answers

frankc

Recent Activity

Donate For Us

What is an optimal way to find duplicate files in C++?

Tags:

c++

file

algorithm

unresolved_external

1 Answers

frankc

Related questions

Recent Activity

Donate For Us