I'm a photographer and doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB raid 1 using rsync. According to my script about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up. Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.
The script has three steps in three files. In this First part I find all image and metadata files and put them into a shelve database (datenbank) with their size as key.
import os
import shelve
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)
counter = 0
for dirpath, dirnames, filenames in walker:
if filenames:
for filename in filenames:
counter += 1
print str(counter)
for file_ext in file_exts:
if file_ext in filename:
filepath = os.path.join(dirpath, filename)
filesize = str(os.path.getsize(filepath))
if not filesize in datenbank:
datenbank[filesize] = []
tmp = datenbank[filesize]
if filepath not in tmp:
tmp.append(filepath)
datenbank[filesize] = tmp
datenbank.sync()
print "done"
datenbank.close()
The second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the md5 hash as key and a list of files as value.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)
datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
counter = 0
space = 0
def md5Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.md5()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for filesize in datenbank:
filepaths = datenbank[filesize]
filepath_count = len(filepaths)
if filepath_count > 1:
counter += filepath_count -1
space += (filepath_count -1) * int(filesize)
for filepath in filepaths:
print counter
checksum = md5Checksum(filepath)
if checksum not in datenbank_step2:
datenbank_step2[checksum] = []
temp = datenbank_step2[checksum]
if filepath not in temp:
temp.append(filepath)
datenbank_step2[checksum] = temp
print counter
print str(space)
datenbank_step2.sync()
datenbank_step2.close()
print "done"
And finally the most dangerous part. For evrey md5 key i retrieve the file list and do an additional sha1. If it matches I delete every file in that list execept the first one and create a hard link to replace the deleted files.
import os
import shelve
import hashlib
datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)
def sha1Checksum(filePath):
with open(filePath, 'rb') as fh:
m = hashlib.sha1()
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
for hashvalue in datenbank:
switch = True
for path in datenbank[hashvalue]:
if switch:
original = path
original_checksum = sha1Checksum(path)
switch = False
else:
if sha1Checksum(path) == original_checksum:
os.unlink(path)
os.link(original, path)
print "delete: ", path
print "done"
What do you think? Thank you very much.
*if that's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.
This looked good, and after sanitizing a bit (to make it work with python 3.4), I ran this on my NAS. While I had hardlinks for files that had not been modified between backups, files that had moved were being duplicated. This recovered that lost disk space for me.
A minor nitpick is that files that are already hardlinks are deleted and relinked. This does not affect the end result anyway.
I did slightly alter the third file ("3.py"):
if sha1Checksum(path) == original_checksum:
tmp_filename = path + ".deleteme"
os.rename(path, tmp_filename)
os.link(original, path)
os.unlink(tmp_filename)
print("Deleted {} ".format(path))
This makes sure that in case of a power-failure or some other similar error, no files are lost, though a trailing "deleteme" is left behind. A recovery script should be quite trivial.
Why not compare the files byte for byte instead of the second checksum? One in a billion two checksums might accidentally match, but direct comparison shouldn't fail. It shouldn't be slower, and might even be faster. Maybe it could be slower when there are more than two files and you have to read the original file for each other. If you really wanted you could get around that by comparing blocks of all the files at once.
EDIT:
I don't think it would require more code, just different. Something like this for the loop body:
data1 = fh1.read(8192)
data2 = fh2.read(8192)
if data1 != data2: return False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With