Find duplicate filenames, and only keep newest file using python

Question

I have +20 000 files, that look like this below, all in the same directory:

8003825.pdf
8003825.tif
8006826.tif

How does one find all duplicate filenames, while ignoring the file extension.

Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)

For example:

"8003825" appears twice

Then look at the metadata of each duplicate file and only keep the newest one.

Similar to this post:

Keep latest file and delete all other

I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?

I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...

Python 2.6 Windows 7

ovgolovin · Accepted Answer

You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.

import os
from collections import namedtuple

directory = #file directory
os.chdir(directory)

newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file = newest_files.get(name)
    this_file_date = os.path.getmtime(file_name)
    if cashed_file is None:
        newest_files[name] = Entry(this_file_date,file_name)
    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)

newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.

In the end you have a dictionary with the most recent files.

Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).

For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:

for file_name in os.listdir(directory):
    name,ext = os.path.splitext(file_name)
    cashed_file_name = newest_files.get(name).file_name
    if file_name != cashed_file_name: #it's not the newest with this name
        os.remove(file_name)

As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:

    else:
        if this_file_date > cashed_file.date: #replace with the newer one
            newest_files[name] = Entry(this_file_date,file_name)
            os.remove(cashed_file.file_name) #this line added

Find duplicate filenames, and only keep newest file using python

Tags:

python

list

duplicates

Tristan Forward

1 Answers

ovgolovin

Recent Activity

Donate For Us

Find duplicate filenames, and only keep newest file using python

Tags:

python

list

duplicates

Tristan Forward

1 Answers

ovgolovin

Related questions

Recent Activity

Donate For Us