Algorithm for sorting unordered list to tree structure

Question

I am looking for an algorithm to sort an unordered list of items into a tree structure, using the minimum number of "is child" comparison operations as possible.

A little background on my specific case, but I guess I am just looking for a generic sorting algorithm of a sort I have been unable to find (it is a hard search term to refine).

I have an unordered list of contours, which are simply lists of coordinates that describe closed polygons. I want to create a tree that represents the relationship between these contours, such that the outermost is the root, with each contour at the next level as children, and so forth. So a tree structure with zero-to-many children per node.

A key requirement of the algorithm is that tests to determine whether or not a contour is the child of another are kept to a minimum, as this operation is very expensive. Contours can (and often do) share many vertices, but should not intersect. These shared vertices usually arise where map limits are reached - picture a set of concentric semi circles against the straight edge of a map. The point-in-poly test is slow if I need to run through lots of point-on-lines before I get to a definitive answer.

Here's the algorithm I have come up with so far. It's pretty naive, no doubt, but it works. There are probably some heuristics that may help - a contour is only likely to be a child of another contour with a depth within a certain range, for example - but I want to nail the basic algorithm first. The first red flag is that it is exponential.

for each candidate_contour in all_contours
    for each contour in all_contours
        // note already contains means "is direct or indirect child of"
        if contour == candidate_contour or contour already contains(candidate_contour)
            continue
        else
            list contours_to_check
            contours_to_check.add(candidate_contour)

            contour parent_contour = candidate_contour.parent
            while (parent_contour != null)
                contours_to_check.add(parent_contour)
                parent_contour = parent_contour.parent

            for each possible_move_candidate in contours_to_check (REVERSE ITERATION)
                if possible_move_candidate is within contour
                    // moving a contour moves the contour and all of its children
                    move possible_move_candidate to direct child of contour
                    break

So that works - or at least it seems to - but it gets very slow with a non-trivial number of contours (think - several hundred, to possibly several thousand).

Is there a fundamentally better way to do this, or indeed - are there known algorithms that deal with exactly this? As mentioned before - the key in my case is to keep the "is contour within" comparisons to a minimum.

Edit to add solution based on Jim's answer below - thanks Jim!

This is the first iteration - which produce good (10x) improvements. See below for iteration 2. This code versus my original algorithm is > 10x faster once the contour set becomes non-trivially big. See image below that is now sorted in a couple of seconds (v's 30 odd seconds prior), and rendered in order. I think there is room to further improve with some added heuristics - for example, now that the original list is sorted according to area, then each new candidate has to be a leaf node somewhere in the tree. The difficulty is determining which branches to traverse to test the existing leaves - if there are many branches/leaves, then it is probably still quicker to cut the search space down by examining the branches at the top.. something more to think about!

    public static iso_area sort_iso_areas(List<iso_area> areas, iso_area root)
    {
        if (areas.Count == 0)
            return null;

        areas.Sort(new iso_comparer_descending());

        foreach (iso_area area in areas)
        {
            if (root.children.Count == 0)
                root.children.Add(area);
            else
            {
                bool found_child = false;
                foreach (iso_area child in root.children)
                {
                    // check if this iso_area is within the child
                    // if it is, follow the children down to find the insertion position
                    if (child.try_add_child(area))
                    {
                        found_child = true;
                        break;
                    }   
                }
                if (!found_child)
                    root.children.Add(area);
            }
        }
        return root;
    }

    // try and add the provided child to this area
    // if it fits, try adding to a subsequent child
    // keep trying until failure - then add to the previous child in which it fitted
    bool try_add_child(iso_area area)
    {
        if (within(area))
        {
            // do a recursive search through all children
            foreach (iso_area child in children)
            {
                if (child.try_add_child(area))
                    return true;
            }
            area.move_to(this);
            return true;
        }
        else
            return false;
    }

ordered contours

Iteration two - comparing against existing leaves only

Following on from my earlier thought that new contours could only fit into existing leaves, it struck me that in fact this would be much quicker as the poly in poly test would fail at the first bounds check for all leaves other than the target leaf. The first solution involved traversing a branch to find the target where, by definition, each and every poly along the way would pass the bounds check, and involve a full poly-in-poly test until no further leaves were found.

Following Jim's comment and re-examination of the code - the second solution did not work, unfortunately. I'm wondering if there may still be some merit to looking at lower elements in the tree before the branches, as the poly-in-poly test should fail quickly, and you know that if you find a leaf that accepts the candidate, there are no more polys that need to be examined..

Iteration two revisited

Although it is not the case that contours can only fit into leaves, it is the case that they almost always do - and also that they will usually fit into a recent predecessor in the ordered list of contours. This final updated code is the fastest yet, and ditches the tree traversal completely. It simply walks backwards through the recent larger polygons and tries each - polys from other branches will likely fail the poly-in-poly test at the bounds check, and the first poly found that surrounds the candidate poly has to be the immediate parent, due to the prior sorting of the list. This code brings the sorting down into the millisecond range again and is about 5x faster than the tree traversal (significant speed improvements were also made to the poly-in-poly test which accounts for the rest of the speed-up). The root is now taken from the sorted list of areas. Note that I now supply a root in the list that I know encompasses all the contours (bounding box of all).

Thanks for the help - and Jim in particular - for helping me think about this problem. The key really was the original sorting of the contours into a list in which it was guaranteed that no contour could be a child of a later contour in the list.

public static iso_area sort_iso_areas(List<iso_area> areas)
    {
        if (areas.Count == 0)
            return null;

        areas.Sort(new iso_comparer_descending());

        for (int i = 0; i < areas.Count; ++i)
        {
            for (int j = i - 1; j >= 0; --j)
            {
                if (areas[j].try_add_child(areas[i]))
                    break;    
            }              
        }
        return areas[0];
    }

My original attempt: 133 s Iteration 1 (traverse branch to find leaf): 9s Iteration 2 (walk backwards through contours in ascending size order): 25ms (with other pt-in-poly improvements also).

Jim Mischel · Accepted Answer

I did something similar a while back by first sorting by area.

If polygon B is contained within polygon A, then the bounding box for polygon A has to be larger than the bounding box for polygon B. More to the point, if you specify the bounding box as ((x1, y1), (x2, y2)), then:

A.x1 < B.x1
A.y1 < B.y1
A.x2 > B.x2
A.y2 > B.y2

(Those relationships might be <= and >= if polygons can share edges or vertices.)

So the first thing you should do is compute the bounding boxes and sort the polygons by bounding box area, descending (so the largest is first).

Create a structure that is essentially a polygon and a list of its children:

PolygonNode
{
    Polygon poly
    PolygonNode[] Children
}

So you start out with your polygons sorted by bounding box area, descending, and an initially empty list of PolygonNode structures:

Polygon[] sortedPolygons
PolygonNode[] theTree

Now, starting from the first member of sortedPolygons, which is the polygon with the largest area, check to see if it's a child of any of the top-level members of theTree. If it's not, add the polygon to the theTree. The bounding boxes help here because you don't have to do the full polygon-in-polygon test if the bounding box test fails.

If it is a child of a node, then check to see if it's a child of one of that node's children, and follow the child chain down until you find the insertion spot.

Repeat that for every polygon in sortedPolygons.

Worst case, that algorithm is O(n^2), which will happen if there are no parent/child relationships. But assuming that there are many nested parent/child relationships, the search space gets cut down very quickly.

You can probably speed it up somewhat by ordering the theTree list and the child nodes lists by position. You could then use a binary search to more quickly locate the potential parent for a polygon. Doing so complicates things a little bit, but it might be worthwhile if you have a lot of top-level polygons. I wouldn't add that optimization on the first cut, though. It's quite possible that the version I outlined using sequential search will be plenty fast enough.

Edit

Understanding the nature of the data helps. I didn't realize when I wrote my original response that your typical case is that given the sorted list of polygons, the normal case is that p[i] is a child of p[i-1], which is a child of p[i-2], etc. Your comments indicate that it's not always the case, but it is very often.

Given that, perhaps you should make a simple modification to your implementation so that you save the last polygon and check it first rather than starting in with the tree. So your loop looks something like this:

    iso_area last_area = null;    // <============
    foreach (iso_area area in areas)
    {
        if (root.children.Count == 0)
            root.children.Add(area);
        else if (!last_area.try_add_child(area))  // <=======
        {
            bool found_child = false;
            foreach (iso_area child in root.children)
            {
                // check if this iso_area is within the child
                // if it is, follow the children down to find the insertion position
                if (child.try_add_child(area))
                {
                    found_child = true;
                    break;
                }   
            }
            if (!found_child)
                root.children.Add(area);
        }
        last_area = area;      // <============
    }
    return root;

If the data is as you said, then this optimization should help quite a bit because it eliminates a bunch of searching through the tree.

Algorithm for sorting unordered list to tree structure

Tags:

algorithm

sorting

Matt

1 Answers

Edit

Jim Mischel

Recent Activity

Donate For Us

Algorithm for sorting unordered list to tree structure

Tags:

algorithm

sorting

Matt

1 Answers

Edit

Jim Mischel

Related questions

Recent Activity

Donate For Us