Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing data bytewise in a effective way (with C++)

Is there a more effective way to compare data bytewise than using the comparison operator of the C++ list container?

I have to compare [large? 10 kByte < size < 500 kByte] amounts of data bytewise, to verify the integrity of external storage devices.

Therefore I read files bytewise and store the values in a list of unsigned chars. The recources of this list are handled by a shared_ptr, so that I can pass it around in the program without the need to worry about the size of the list

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

This is done twice, because there're always two copies of the file. The content of these two files has to be compared, and throw an exception if a mismatch is found

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

My problem is this:
With increasing file size, the comparison of the lists gets very slow. Working with file sizes of about 100 kByte, it will take up to two seconds to compare the content. Increasing and decreasing with the file size....

Is there a more effective way of doing this comparison? Is it a problem of the used container?

like image 775
zitroneneis Avatar asked Aug 31 '10 12:08

zitroneneis


3 Answers

Don't use a std::list use a std::vector.

std::list is a linked-list, elements are not guaranteed to be stored contiguously.

std::vector on the other hand seems far better suited for the specified task (storing contiguous bytes and comparing large chunks of data).

If you have to compare several files multiple times and don't care about where the differences are, you may also compute a hash of each file and compare the hashes. This would be even faster.

like image 53
ereOn Avatar answered Nov 12 '22 05:11

ereOn


My first piece of advice would be to profile your code.

The reason I say that is that, no matter how slow your comparison code is, I suspect your file I/O time dwarfs it. You don't want to waste days trying to optimize a part of your code that only takes 1% of your runtime as-is.

It could even be that there is something else you didn't notice before that is actually causing the slowness. You won't know until you profile.

like image 2
T.E.D. Avatar answered Nov 12 '22 03:11

T.E.D.


If there's nothing else to be done with the contents of those files (looks like you're going to let them get deleted by shared_ptr at the end of CompareContent()'s scope), why not compare the files using iterators, not creating any containers at all?

Here's a piece of my code that compares two files bytewise:

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

EDIT: if you do need to keep contents around, I think std::deque might be slightly more efficient than std::vector for the reasons explained in GOTW #54.. or not -- profiling will tell. And still, there would be the need for only one of the two identical files to be loaded in memory -- I'd read one into a deque and compare with the other file's istreambuf_iterator.

like image 1
Cubbi Avatar answered Nov 12 '22 04:11

Cubbi