Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I check for duplicate file from URL before downloading

I have thousands of images in my folder on my computer and I am trying to find out how can I check if the file from given URL is already downloaded. Is is possible somehow?

This only give me size of the file.

URL url = new URL("http://test.com/test.jpg");
url.openConnection().getContentLength();

For duplicate file I use

FileUtils.contentEquals(file1, file2)

Thank you for your answers!

like image 266
Filip Bouška Avatar asked Nov 21 '22 16:11

Filip Bouška


1 Answers

If you have a base URL and store files with the same filenames. You can ask the server if it's worth downloading the image again thanks to the file modification time and the If-Modified-Since HTTP Header.

    File f = new File();// the file to download
    HttpURLConnection con = (HttpURLConnection) new URL("http://www.test.com/"+f.getName()).openConnection();
    // Add the IfModifiedSince HEADER
    con.setIfModifiedSince(f.lastModified());
    con.setRequestMethod("GET");
    con.connect();
    if(con.getResponseCode() == 304) {
        System.out.println(f+ " : already downloaded");
    } else {
        // Download the content again and store the image again
    }

It will work if the modification time of the local file has been left intact since the first download and if the server supports IfModifiedSince header.

If you don't know how to match the filename and the URL then there is no obvious way to it.

You could do some experiments with a fast HEAD request and extract some relevant informations like :

  • Content-Length
  • Last-Modified
  • ETag

Content-Length + Last-Modified could be a good match.

For ETags if you know how the http server builds the ETag you could try to build it on your side (on all your local files) and use it as a value to compare. Some info on ETags:

  • http://bitworking.org/news/150/REST-Tip-Deep-etags-give-you-more-benefits

  • https://serverfault.com/questions/120538/etag-configuration-with-multiple-apache-servers-or-cdn-how-does-google-do-etag

Unfortunately ETag can be constructed with informations only visible to server (inode number) so it will be impossible for you to rebuild it.

It will certainly be easier/faster to download your files again.

like image 120
nomoa Avatar answered Jan 10 '23 06:01

nomoa