Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare if two file are the same over the internet

Here is my scenario - I have a windows store app. I have a local file, and a link to a file on the internet. Is there a way I can check if these two files are the same, WITHOUT downloading the file from the link?

The code used to get the file is this:

private static async void SetImage(PlaylistItem song, string source, string imageName)
{

    HttpClient client = new HttpClient();

    HttpResponseMessage message = await client.GetAsync(source);

    StorageFolder myfolder = Windows.Storage.ApplicationData.Current.LocalFolder;
    StorageFile sampleFile = await myfolder.CreateFileAsync(imageName, CreationCollisionOption.ReplaceExisting);
    byte[] byteArrayFile = await message.Content.ReadAsByteArrayAsync();

    await FileIO.WriteBytesAsync(sampleFile, byteArrayFile);

    song.Image = new BitmapImage(new Uri(sampleFile.Path));

}
like image 951
Mario Stoilov Avatar asked Aug 08 '13 11:08

Mario Stoilov


1 Answers

The usual solution is to keep a hash of the cloud file somewhere, usually in the file's metadata and compare it with the hash of your local file. Checksums are unsuitable for this operation because they have a very high chance of collision (ie different files having the same checksum).

Most storage services (Azure Blob storage, Amazon S3, CloudFiles) actually use a file's MD5 or SHA hash as its ETag, the value used to detect changes to a file for caching and concurrency purposes. Typically, a HEAD operation on the file will return its headers and ETag value.

If you have the option of picking your own algorithm, choose SHA256 or higher as these algorithms are highly optimized and their large block size means that calculating hashes for large files is much faster. SHA256 is actually much faster than the older MD5 algorithm.

What storage service are you using?

EDIT

If you only want to check files to avoid downloading them again, you can use the ETag directly. ETag was created for exactly this purpose. You just have to store it together with your file when you download it the first time. That's how proxies and caches know to send you a cached version of a picture instead of hitting the destination server.

In fact, you can probably just do a GET on the file with the ETag/If-None-Match headers. The intermediate proxies and the final web server will return a 304 status code if the destination file hasn't changed. This will halve the number of requests you need to download all images in your list.

An alternative is to store the Last-Modified header value for the file and use the If-Modified-Since header in GET

EDIT 2

You mention that the ETag header is null, although your code doesn't show how you retrieve it.

HttpResponseMessage has multiple Headers properties, both on the message itself and its Content. You need to use the proper property to retrieve the ETag value.

You can also check using Fiddler to ensure the server does actually return an ETag.

EDIT 3

Finally found a way to get an ETag from Youtube! The answer comes from "How to get thumbnail of YouTube video link using YouTube API?"

Doing a HEAD or GET on a YouTube thumbnail from ytimg.com does NOT return the ETag or Last-Modified headers.

Using YouTube's Data API and doing a GET on gdata.youtube.com on the other hand, returns a wealth of information about the video. An ETag value is included, although I suspect it changes whenever the video changes. This may be OK though, if you only want to download an image when the video changes, or you don't want to download the image a second time again.

The code I used was:

var url = "http://gdata.youtube.com/feeds/api/videos/npvJ9FTgZbM?v=2&prettyprint=true&alt=json";

using(var  client = new HttpClient())
{
    var response = await client.GetAsync(url);
    var etag1 = response.Headers.ETag;
    var content = await response.Content.ReadAsStringAsync();
    ...
}
like image 121
Panagiotis Kanavos Avatar answered Sep 18 '22 01:09

Panagiotis Kanavos