Is there a more effective way to compare data bytewise than using the comparison operator of the C++ list container?
I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.
Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list
typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;
contentPtr GetContent( fs::path filePath ){
contentPtr actualContent (new list< unsigned char > );
// Read the file with a stream, put read values into actual content
return actualContent;
This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found
void CompareContent() throw( NotMatchingException() ){
// this part is very fast, below 50ms
contentPtr contentA = GetContent("/fileA");
contentPtr contentB = GetContent("/fileB");
// the next part takes about 2secs with a file size of ~64kByte
if( *contentA != *contentB )
throw( NotMatchingException() );
}
My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....
Is there a more effective way of doing this comparison? Is it a problem of the used container?
Don't use a std::list
use a std::vector
.
std::list
is a linked-list, elements are not guaranteed to be stored contiguously.
std::vector
on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).
If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.
My first piece of advice would be to profile your code.
The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.
It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.
If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?
Here's a piece of my code that compares two files bytewise:
// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(host_f)))
{
// we're good: move table to OutPath, remove other files
EDIT: if you do need to keep contents around, I think std::deque
might be slightly more efficient than std::vector
for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With