Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Verifying that two files are identical using pure PHP?

TL;DR: I have an CMS system that stores attachments (opaque files) using SHA-1 of the file contents as the filename. How to verify if uploaded file really matches one in the storage, given that I already know that SHA-1 hash matches for both files? I'd like to have high performance.

Long version:

When an user uploads a new file to the system, I compute SHA-1 hash of the uploaded file contents and then check if a file with identical hash already exists in the storage backend. PHP puts the uploaded file in /tmp before my code gets to run and then I run sha1sum against the uploaded file to get SHA-1 hash of the file contents. I then compute fanout from the computed SHA-1 hash and decide storage directory under NFS mounted directory hierarchy. (For example, if the SHA-1 hash for a file contents is 37aefc1e145992f2cc16fabadcfe23eede5fb094 the permanent file name is /nfs/data/files/37/ae/fc1e145992f2cc16fabadcfe23eede5fb094.) In addition to saving the actual file contents, I INSERT a new line into a SQL database for the user submitted meta data (e.g. Content-Type, original filename, datestamp, etc).

The corner case I'm currently figuring out is the case where a new uploaded file has SHA-1 hash that matches existing hash in the storage backend. I know that the changes for this happening by accident are astronomically low, but I'd like to be sure. (For on purpose case, see https://shattered.io/)

Given two filenames $file_a and $file_b, how to quickly check if both files have identical contents? Assume that files are too big to be loaded into memory. With Python, I'd use filecmp.cmp() but PHP does not seem to have anything similar. I know that this can be done with fread() and aborting if a non-matching byte is found, but I'd rather not write that code.

like image 874
Mikko Rantalainen Avatar asked Sep 17 '13 12:09

Mikko Rantalainen


4 Answers

If you already have one SHA1 sum, you can simply do:

if ($known_sha1 == sha1_file($new_file))

otherwise

if (filesize($file_a) == filesize($file_b)
    && md5_file($file_a) == md5_file($file_b)
)

Checking file size too, to somewhat prevent a hash collision (which is already very unlikely). Also using MD5 because it's significantly faster than the SHA algorithms (but a little less unique).


Update:

This is how to exactly compare two files against each other.

function compareFiles($file_a, $file_b)
{
    if (filesize($file_a) != filesize($file_b))
        return false;

    $chunksize = 4096;
    $fp_a = fopen($file_a, 'rb');
    $fp_b = fopen($file_b, 'rb');
        
    while (!feof($fp_a) && !feof($fp_b))
    {
        $d_a = fread($fp_a, $chunksize)
        $d_b = fread($fp_b, $chunksize);
        if ($d_a === false || $d_b === false || $d_a !== $d_b)
        {
            fclose($fp_a);
            fclose($fp_b);
            return false;
        }
    }
 
    fclose($fp_a);
    fclose($fp_b);
          
    return true;
}
like image 167
Cobra_Fast Avatar answered Oct 08 '22 14:10

Cobra_Fast


Update

If you want to make sure that files are equal then you should first check the file sizes and if they match then just diff the file content. This is much faster than using a hash function and will definitely give the correct result.


It is not required to load the whole file content into memory if you hash the contents using md5_file() or sha1_file() or another hash_function. Here comes an example using md5:

$hash = md5_file('big.file'); // big.file is 1GB  in my test
var_dump(memory_get_peak_usage());

Output:

int(330540)

In your example it would be:

if(md5_file('FILEA') === md5_file('FILEB')) {
    echo 'files are equal';
}

Further note, when you use a hash function you'll always have a situation where you need to decide between complexity on the one hand and the probability of collisions (meaning that two different messages produce the same hash) on the other hand.

like image 22
hek2mgl Avatar answered Oct 08 '22 15:10

hek2mgl


When your files are big and binary, you can just test a few bytes of it from a few offsets. It should be much faster than any hashing function, especially that the function returns result by the first different character.

However, this method won't work for files with only a few differend characters. It's the best for big archives, videos and so on.

function areFilesEqual($filename1, $filename2, $accuracy)
{

    $filesize1 = filesize($filename1);
    $filesize2 = filesize($filename2);

    if ($filesize1===$filesize2) {

        $file1 = fopen($filename1, 'r');
        $file2 = fopen($filename2, 'r');

        for ($i=0; $i<$filesize1 && $i<$filesize2; $i+=$accuracy) {
            fseek($file1, $i);
            fseek($file2, $i);
            if (fgetc($file1)!==fgetc($file2)) return false;
        }

        fclose($file1);
        fclose($file2);

        return true;
    }

    return false;
}
like image 3
sliwhas Avatar answered Oct 08 '22 15:10

sliwhas


Use Sha1 hash, just like you do. If they are equal, compare their md5 hashs and filesize also. If you THEN encounter a file that matches in all 3 checks, but is NOT equal - you just found the holy grail :D

like image 2
dognose Avatar answered Oct 08 '22 15:10

dognose