I am having a list of tuples containing filename and filepath. I want to find duplicates filename(but filepath may be different) i.e. tuples whose filename is same but filepath may be different.
Example of a list of tuples:
file_info = [('foo1.txt','/home/fold1'), ('foo2.txt','/home/fold2'), ('foo1.txt','/home/fold3')]
I want to find the duplicate filename i.e. file_info[2](in the above case) print it and delete it. I possibly could iteratively check like:
count = 0
for (filename,filepath) in file_info:
count = count + 1
for (filename1,filepath1) in file_info[count:]:
if filename == filename1:
print filename1,filepath1
file_info.remove((filename1,filepath1))
But is there a more efficient/shorter/more correct/pythonic way of accomplishing the same task. Thank You.
Using a set lets you avoid creating a double loop; add items you haven't seen yet to a new list to avoid altering the list you are looping over (which will lead to skipped items):
seen = set()
keep = []
for filename, filepath in file_info:
if filename in seen:
print filename, filepath
else:
seen.add(filename)
keep.append((filename, filepath))
file_info = keep
If order doesn't matter and you don't have to print the items you removed, then another approach is to use a dictionary:
file_info = dict(reversed(file_info)).items()
Reversing the input list assures that the first entry is kept rather than the last.
If you needed all the full paths for files with duplicates, I'd build a dictionary with lists as values, then remove anything that has only one element:
filename_to_paths = {}
for filename, filepath in file_info:
filename_to_paths.setdefault(filename, []).append(filepath)
duplicates = {filename: paths for filename, paths in filename_to_paths.iteritems() if len(paths) > 1}
The duplicates
dictionary now only contains filenames where you have more than 1 path in the file_info
list.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With