Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Method of finding common files between two given paths in Python

I have written code to find out common files between two given folders (paths) accounting for all levels of sub folders if present.

Please suggest if there is a more efficient method. It's taking too long if folders with many levels of subfolders are given.

def findCommonDeep(self,path1,path2):
    commonfiles = []

    for (dirpath1, dirname1, filenames1) in os.walk(path1):
        for file in filenames1:
            for (dirpath2, dirname2, filenames2) in os.walk(path2):
                if (file in filenames2 and isfile(join(dirpath2, file))):
                        commonfiles.append(file)

    print(commonfiles)

and calling this function with paths, like shown below:

findCommonDeep("/home/naseer/Python", "/home/naseer/C")

I understand that if I store a list of all files for any given path the speed of execution can be reduced. but I guess that will run out of memory. Please guide me in approaching to this more efficiently.

like image 411
Naseer Mohammad Avatar asked Aug 01 '18 04:08

Naseer Mohammad


Video Answer


1 Answers

You can use generator expression to transform the output of os.walk into two sets and use set intersection to efficiently identify the common paths.

import os
def findCommonDeep(path1, path2):
    files1 = set(os.path.relpath(os.path.join(root, file), path1) for root, _, files in os.walk(path1) for file in files)
    files2 = set(os.path.relpath(os.path.join(root, file), path2) for root, _, files in os.walk(path2) for file in files)
    return files1 & files2

To reduce code duplication in the code above, you can use another list comprehension:

import os
def findCommonDeep(path1, path2):
    return set.intersection(*(set(os.path.relpath(os.path.join(root, file), path) for root, _, files in os.walk(path) for file in files) for path in (path1, path2)))

And if you're looking for just the common file names rather than common path names, you can make the generator expression output just the file names instead:

def findCommonDeep(path1, path2):
    return set.intersection(*(set(file for _, _, files in os.walk(path) for file in files) for path in (path1, path2)))

This is more efficient because it takes advantage of Python's set intersection operation, which has an average time complexity of O(min(len(n), len(m)), whereas your code with 2 nested loops always takes O(n^2).

like image 129
blhsing Avatar answered Oct 14 '22 07:10

blhsing