I am writing a Python program to find and remove duplicate files from a folder.
I have multiple copies of mp3 files, and some other files. I am using the sh1 algorithm.
How can I find these duplicate files and remove them?
Part 1. Is it safe to delete duplicate files on your computer? Here's the short answer: Certain duplicate files are safe to delete. They can be deleted to help free up storage and better organize files.
WinZip System Utilities Suite can easily locate duplicate files in a folder or directory and it's compatible with any Windows 11 PC operating system. You may scan your entire computer or just certain folders. If you need to scan your whole computer, simply press the Scan button to begin your duplicate file search.
The approaches in the other solutions are very cool, but they forget about an important property of duplicate files - they have the same file size. Calculating the expensive hash only on files with the same size will save tremendous amount of CPU; performance comparisons at the end, here's the explanation.
Iterating on the solid answers given by @nosklo and borrowing the idea of @Raffi to have a fast hash of just the beginning of each file, and calculating the full one only on collisions in the fast hash, here are the steps:
The code:
#!/usr/bin/env python # if running in py3, change the shebang, drop the next import for readability (it does no harm in py3) from __future__ import print_function # py2 compatibility from collections import defaultdict import hashlib import os import sys def chunk_reader(fobj, chunk_size=1024): """Generator that reads a file in chunks of bytes""" while True: chunk = fobj.read(chunk_size) if not chunk: return yield chunk def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1): hashobj = hash() file_object = open(filename, 'rb') if first_chunk_only: hashobj.update(file_object.read(1024)) else: for chunk in chunk_reader(file_object): hashobj.update(chunk) hashed = hashobj.digest() file_object.close() return hashed def check_for_duplicates(paths, hash=hashlib.sha1): hashes_by_size = defaultdict(list) # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ] hashes_on_1k = defaultdict(list) # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ] hashes_full = {} # dict of full_file_hash: full_path_to_file_string for path in paths: for dirpath, dirnames, filenames in os.walk(path): # get all files that have the same size - they are the collision candidates for filename in filenames: full_path = os.path.join(dirpath, filename) try: # if the target is a symlink (soft one), this will # dereference it - change the value to the actual target file full_path = os.path.realpath(full_path) file_size = os.path.getsize(full_path) hashes_by_size[file_size].append(full_path) except (OSError,): # not accessible (permissions, etc) - pass on continue # For all files with the same file size, get their hash on the 1st 1024 bytes only for size_in_bytes, files in hashes_by_size.items(): if len(files) < 2: continue # this file size is unique, no need to spend CPU cycles on it for filename in files: try: small_hash = get_hash(filename, first_chunk_only=True) # the key is the hash on the first 1024 bytes plus the size - to # avoid collisions on equal hashes in the first part of the file # credits to @Futal for the optimization hashes_on_1k[(small_hash, size_in_bytes)].append(filename) except (OSError,): # the file access might've changed till the exec point got here continue # For all files with the hash on the 1st 1024 bytes, get their hash on the full file - collisions will be duplicates for __, files_list in hashes_on_1k.items(): if len(files_list) < 2: continue # this hash of fist 1k file bytes is unique, no need to spend cpy cycles on it for filename in files_list: try: full_hash = get_hash(filename, first_chunk_only=False) duplicate = hashes_full.get(full_hash) if duplicate: print("Duplicate found: {} and {}".format(filename, duplicate)) else: hashes_full[full_hash] = filename except (OSError,): # the file access might've changed till the exec point got here continue if __name__ == "__main__": if sys.argv[1:]: check_for_duplicates(sys.argv[1:]) else: print("Please pass the paths to check as parameters to the script")
And, here's the fun part - performance comparisons.
Baseline -
Processor : Feroceon 88FR131 rev 1 (v5l) BogoMIPS : 1599.07
(i.e. my low-end NAS :), running Python 2.7.11.
So, the output of @nosklo's very handy solution:
root@NAS:InstantUpload# time ~/scripts/checkDuplicates.py Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg real 5m44.198s user 4m44.550s sys 0m33.530s
And, here's the version with filter on size check, then small hashes, and finally full hash if collisions are found:
root@NAS:InstantUpload# time ~/scripts/checkDuplicatesSmallHash.py . "/i-data/51608399/photo/Todor phone" Duplicate found: ./IMG_20160216_074620 (2).jpg and ./IMG_20160216_074620.jpg Duplicate found: ./IMG_20160204_150311.jpg and ./IMG_20160204_150311 (2).jpg Duplicate found: ./IMG_20151231_143053 (2).jpg and ./IMG_20151231_143053.jpg Duplicate found: ./IMG_20151125_233019 (2).jpg and ./IMG_20151125_233019.jpg real 0m1.398s user 0m1.200s sys 0m0.080s
Both versions were ran 3 times each, to get the avg of the time needed.
So v1 is (user+sys) 284s, the other - 2s; quite a diff, huh :) With this increase, one could go to SHA512, or even fancier - the perf penalty will be mitigated by the less calculations needed.
Negatives:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With