Comparing data bytewise in a effective way (with C++)

Question

Is there a more effective way to compare data bytewise than using the comparison operator of the C++ list container?

I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.

Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....

Is there a more effective way of doing this comparison? Is it a problem of the used container?

ereOn · Accepted Answer

Don't use a std::list use a std::vector.

std::list is a linked-list, elements are not guaranteed to be stored contiguously.

std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).

If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

T.E.D. · Answer

My first piece of advice would be to profile your code.

The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.

It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

Cubbi · Answer

If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

Comparing data bytewise in a effective way (with C++)

Tags:

c++

performance

comparison

stl

containers

zitroneneis

3 Answers

ereOn

T.E.D.

Cubbi

Recent Activity

Donate For Us

Comparing data bytewise in a effective way (with C++)

Tags:

c++

performance

comparison

stl

containers

zitroneneis

3 Answers

ereOn

T.E.D.

Cubbi

Related questions

Recent Activity

Donate For Us