Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently identify a binary file

What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.

The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.

(while Java code-examples are preferred, language-agnostic answers are encouraged)

Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).

like image 237
hpique Avatar asked Dec 04 '22 11:12

hpique


2 Answers

An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.

like image 172
Ferruccio Avatar answered Dec 29 '22 13:12

Ferruccio


That's what hashing is for. See MessageDigest.

Note that if your file is too big to be read in memory, that's OK because you can feed chunks of the file to the hash function. MD5 and SHA1 for example can take blocks of 512 bits.

Also, two files with the same hash aren't necessarily identical (it's very rare that they aren't though), but two files that are identical have necessarily the same hash.

like image 33
NullUserException Avatar answered Dec 29 '22 15:12

NullUserException