Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representing a set of URLs in a list as a tree structure

Tags:

python

I have a list of dicts that stores URLs. It has simply two fields, title and url. Example:

[
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

However, I'd to get a tree structure from this list of dicts. I'm looking for something like this:

{ 'www.example.com': 
  [ 
    { 'something': 
      [ 
        { 'thisthing':
          [
            { 'title': 'Detail Page', 'url': 'detail.htm'}
          ]
        },
        [
          { 'title': 'Index Page', 'url': 'index.htm'},
          { 'title': 'Other Page', 'url': 'other.htm'}
        ]
      ]
    },
    { 'thatthing': 
      [ 
        { 'title': 'About Page', 'url': 'about.htm'}
      ]
    }
  ]
}

My first attempt at this would be a urlparse soup in a bunch of for loops and I'm confident that there's a better and faster way to do this.

I've seen people on SO work magic with list comprehensions, lambda functions, etc. I'm still in the process of figuring it out.

(For Django developers: I'll be using this my Django application. I'm storing the URLs in a model called Page which has two fields name and title)

like image 836
Mridang Agarwalla Avatar asked Oct 17 '11 13:10

Mridang Agarwalla


Video Answer


1 Answers

Third time is the charm... that is some nice structure you have there :). In your comment you mention that you "haven't been able to think of a better tree format to represent data like this"... this made me again take the liberty to (just slightly) alter the formatting of the output. In order to dynamically add subelements, a dictionary has to be created to house them. But for "leaf nodes", this dictionary is never filled in. If desired these can of course be removed by another loop, but it cannot happen during the iteration because the empty dict should be present for possible new nodes. The some goes for nodes that do not have a file in the: these will contain an empty list.

ll = [
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

# First build a list of all url segments: final item is the title/url dict
paths = []
for item in ll:
    split = item['url'].split('/')
    paths.append(split[2:-1])
    paths[-1].append({'title': item['title'], 'url': split[-1]})

# Loop over these paths, building the format as we go along
root = {}
for path in paths:
    branch = root.setdefault(path[0], [{}, []])
    for step in path[1:-1]:
        branch = branch[0].setdefault(step, [{}, []])
    branch[1].append(path[-1])

# As for the cleanup: because of the alternating lists and
# dicts it is a bit more complex, but the following works:
def walker(coll):
    if isinstance(coll, list):
        for item in coll:
            yield item
    if isinstance(coll, dict):
        for item in coll.itervalues():
            yield item

def deleter(coll):
    for data in walker(coll):
        if data == [] or data == {}:
            coll.remove(data)
        deleter(data)

deleter(root)

import pprint
pprint.pprint(root)

Output:

{'www.example.com':
    [
        {'something':
            [
                {'thisthing':
                    [
                        [
                            {'title': 'Detail Page', 'url': 'detail.htm'}
                        ]
                    ]
                },
                [
                    {'title': 'Index Page', 'url': 'index.htm'},
                    {'title': 'Other Page', 'url': 'other.htm'}
                ]
            ],
         'thatthing':
            [
                [
                    {'title': 'About Page', 'url': 'about.htm'}
                ]
            ]
        },
    ]
}
like image 62
jro Avatar answered Oct 28 '22 02:10

jro