I have +20 000 files, that look like this below, all in the same directory:
8003825.pdf
8003825.tif
8006826.tif
How does one find all duplicate filenames, while ignoring the file extension.
Clarification: I refer to a duplicate being a file with the same filename while ignoring the file extension. I do not care if the file is not 100% the same (ex. hashsize or anything like that)
For example:
"8003825" appears twice
Then look at the metadata of each duplicate file and only keep the newest one.
Similar to this post:
Keep latest file and delete all other
I think I have to create a list of all files, check if file already exists. If so then use os.stat to determine the modification date?
I'm a little concerned about loading all those filename's into memory. And wondering if there is a more pythonic way of doing things...
Python 2.6 Windows 7
You can do it with O(n) complexity. The solutions with sort have O(n*log(n)) complexity.
import os
from collections import namedtuple
directory = #file directory
os.chdir(directory)
newest_files = {}
Entry = namedtuple('Entry',['date','file_name'])
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file = newest_files.get(name)
this_file_date = os.path.getmtime(file_name)
if cashed_file is None:
newest_files[name] = Entry(this_file_date,file_name)
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
newest_files is a dictonary having file names without extensions as keys with values of named tuples which hold file full file name and modification date. If the new file that is encountered is inside the dictionary, its date is compared to the stored in the dictionary one and it is replaced if necessary.
In the end you have a dictionary with the most recent files.
Then you may use this list to perform the second pass. Note, that lookup complexity in the dictionary is O(1). So the overall complexity of looking all n files in the dictionary is O(n).
For example, if you want to leave only the newest files with the same name and delete the other, this can be achieved in the following way:
for file_name in os.listdir(directory):
name,ext = os.path.splitext(file_name)
cashed_file_name = newest_files.get(name).file_name
if file_name != cashed_file_name: #it's not the newest with this name
os.remove(file_name)
As suggested by Blckknght in the comments, you can even avoid the second pass and delete the older file as soon as you encounter the newer one, just by adding one line of the code:
else:
if this_file_date > cashed_file.date: #replace with the newer one
newest_files[name] = Entry(this_file_date,file_name)
os.remove(cashed_file.file_name) #this line added
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With