Finding duplicate files by content across multiple directories

Question

I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the names of the files would be different, but the content may match.

Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?

ghostdog74 · Accepted Answer

if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

if you want to work with Python, a simple implementation

import hashlib,os
def sha(filename):    
    ''' function to get sha of file '''
    d = hashlib.sha512()
    try:
        d.update(open(filename).read())
    except Exception,e:
        print e
    else:
        return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
    for files in f:
        filename=os.path.join(r,files)
        digest=sha(filename)
        if not s.has_key(digest):
            s[digest]=filename
        else:
            print "Duplicates: %s <==> %s " %( filename, s[digest])

if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)

Finding duplicate files by content across multiple directories

Tags:

regex

duplicate-data

gagneet

1 Answers

ghostdog74

Recent Activity

Donate For Us

Finding duplicate files by content across multiple directories

Tags:

regex

duplicate-data

gagneet

1 Answers

ghostdog74

Related questions

Recent Activity

Donate For Us