Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Construct a tree from list os file paths (Python) - Performance dependent

Hey I am working on a very high performance file-managing/analyzing toolkit written in python. I want to create a function that gives me a list or something like that in a tree format. Something like in this question (java-related)

From:

dir/file
dir/dir2/file2
dir/file3
dir3/file4
dir3/file5

Note: the list of paths is unsorted

To:

dir/
    file
    dir2/
        file2
    file3
dir3/
    file4
    file5

[[dir, [file, [dir2, [file2]], file3]], [dir3, [file4, file5]]]

something along those lines. I've been playing around with some ideas but none of them provided the speed that I would like to have.

Note: I do already have the list of paths, so no worrying about that. The function takes paths list and gives tree list.

Thanks in Advance

like image 764
cwoebker Avatar asked Dec 13 '11 05:12

cwoebker


1 Answers

Now that you clarified the question a bit more, I guess the following is what you want:

from collections import defaultdict

input_ = '''dir/file
dir/dir2/file2
dir/file3
dir2/alpha/beta/gamma/delta
dir2/alpha/beta/gamma/delta/
dir3/file4
dir3/file5'''

FILE_MARKER = '<files>'

def attach(branch, trunk):
    '''
    Insert a branch of directories on its trunk.
    '''
    parts = branch.split('/', 1)
    if len(parts) == 1:  # branch is a file
        trunk[FILE_MARKER].append(parts[0])
    else:
        node, others = parts
        if node not in trunk:
            trunk[node] = defaultdict(dict, ((FILE_MARKER, []),))
        attach(others, trunk[node])

def prettify(d, indent=0):
    '''
    Print the file tree structure with proper indentation.
    '''
    for key, value in d.iteritems():
        if key == FILE_MARKER:
            if value:
                print '  ' * indent + str(value)
        else:
            print '  ' * indent + str(key)
            if isinstance(value, dict):
                prettify(value, indent+1)
            else:
                print '  ' * (indent+1) + str(value)



main_dict = defaultdict(dict, ((FILE_MARKER, []),))
for line in input_.split('\n'):
    attach(line, main_dict)

prettify(main_dict)

It outputs:

dir3
  ['file4', 'file5']
dir2
  alpha
    beta
      gamma
        ['delta']
        delta
          ['']
dir
  dir2
    ['file2']
  ['file', 'file3']

A few thing to note:

  • The script make heavy use of defaultdicts, basically this allows to skip checking for the existence of a key and its initialisation if it is not there
  • Directory names are mapped to dictionary keys, I thought this might be a good feature for you, as key are hashed and you will able to retrieve information much faster this way than with lists. You can access the hierarchy in the form main_dict['dir2']['alpha']['beta']...
  • Note the difference between .../delta and .../delta/. I thought this was helpful for you to be able to quickly differenciate between your leaf being a directory or a file.

I hope this answers your question. If anything is unclear, post a comment.

like image 74
mac Avatar answered Sep 19 '22 01:09

mac