Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace duplicate files with hard links using python?

I'm a photographer and doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB raid 1 using rsync. According to my script about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up. Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.

The script has three steps in three files. In this First part I find all image and metadata files and put them into a shelve database (datenbank) with their size as key.

import os
import shelve

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)

#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)

counter = 0

for dirpath, dirnames, filenames in walker:
  if filenames:
    for filename in filenames:
      counter += 1
      print str(counter)
      for file_ext in file_exts:
        if file_ext in filename:
          filepath = os.path.join(dirpath, filename)
          filesize = str(os.path.getsize(filepath))
          if not filesize in datenbank:
            datenbank[filesize] = []
          tmp = datenbank[filesize]
          if filepath not in tmp:
            tmp.append(filepath)
            datenbank[filesize] = tmp

datenbank.sync()
print "done"
datenbank.close()

The second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the md5 hash as key and a list of files as value.

import os
import shelve
import hashlib

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)

datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)

counter = 0
space = 0

def md5Checksum(filePath):
    with open(filePath, 'rb') as fh:
        m = hashlib.md5()
        while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
        return m.hexdigest()


for filesize in datenbank:
  filepaths = datenbank[filesize]
  filepath_count = len(filepaths)
  if filepath_count > 1:
    counter += filepath_count -1
    space += (filepath_count -1) * int(filesize)
    for filepath in filepaths:
      print counter
      checksum = md5Checksum(filepath)
      if checksum not in datenbank_step2:
        datenbank_step2[checksum] = []
      temp = datenbank_step2[checksum]
      if filepath not in temp:
        temp.append(filepath)
        datenbank_step2[checksum] = temp

print counter
print str(space)

datenbank_step2.sync()
datenbank_step2.close()
print "done"

And finally the most dangerous part. For evrey md5 key i retrieve the file list and do an additional sha1. If it matches I delete every file in that list execept the first one and create a hard link to replace the deleted files.

import os
import shelve
import hashlib

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)

def sha1Checksum(filePath):
    with open(filePath, 'rb') as fh:
        m = hashlib.sha1()
        while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
        return m.hexdigest()

for hashvalue in datenbank:
  switch = True
  for path in datenbank[hashvalue]:
    if switch:
      original = path
      original_checksum = sha1Checksum(path)
      switch = False
    else:
      if sha1Checksum(path) == original_checksum:
        os.unlink(path)
        os.link(original, path)
        print "delete: ", path
print "done"

What do you think? Thank you very much.

*if that's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.

like image 349
JasonTS Avatar asked Oct 22 '22 07:10

JasonTS


2 Answers

This looked good, and after sanitizing a bit (to make it work with python 3.4), I ran this on my NAS. While I had hardlinks for files that had not been modified between backups, files that had moved were being duplicated. This recovered that lost disk space for me.

A minor nitpick is that files that are already hardlinks are deleted and relinked. This does not affect the end result anyway.

I did slightly alter the third file ("3.py"):

if sha1Checksum(path) == original_checksum:
     tmp_filename = path + ".deleteme"
     os.rename(path, tmp_filename)
     os.link(original, path)
     os.unlink(tmp_filename)
     print("Deleted {} ".format(path))

This makes sure that in case of a power-failure or some other similar error, no files are lost, though a trailing "deleteme" is left behind. A recovery script should be quite trivial.

like image 77
WhyNotHugo Avatar answered Oct 27 '22 00:10

WhyNotHugo


Why not compare the files byte for byte instead of the second checksum? One in a billion two checksums might accidentally match, but direct comparison shouldn't fail. It shouldn't be slower, and might even be faster. Maybe it could be slower when there are more than two files and you have to read the original file for each other. If you really wanted you could get around that by comparing blocks of all the files at once.

EDIT:

I don't think it would require more code, just different. Something like this for the loop body:

data1 = fh1.read(8192)
data2 = fh2.read(8192)
if data1 != data2: return False
like image 38
morningstar Avatar answered Oct 27 '22 00:10

morningstar