Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data structure keeping objects sorted on multiple keys

I have a python program where I am using a priority queue to keep track of objects to be processed. At the moment the queue is implemented using a SortedList, which is working very well.

However, I need to extend this code, so that the list is kept sorted on multiple keys. Somewhat like an SQL database with indexes on multiple columns, so that I can efficiently access, add and delete objects on all keys. My workload is add/delete heavy. To give an idea of what I want to do, here is some pseudo code:

ml = MultiSortedList()
ml.append((1, "z", 1.5), a)
ml.append((2, "a", 40.0), b)
ml.append((3, "f", 0.5), c)

print(ml.sorted(0))
   [((1, "z", 1.5), a),
   ((2, "a", 40.0), b),
   ((3, "f", 0.5), c),]

print(ml.sorted(2))
   [((3, "f", 0.5), c),
   ((1, "z", 1.5), a),
   ((2, "a", 40.0), b)]

print(ml.sorted(2).pop(1)
   (1, "z", 1.5), a)

print(ml.sorted(0))
   [((2, "a", 40.0), b),
   ((3, "f", 0.5), c)]

I can't quite figure out how to do this efficiently. Of course I could sort the list again for each access to a different column, but that is simply too expensive. Also, the O(n) delete operations on python lists become painful, as the list may contain several thousands of objects.

Is there an existing data structure (preferably already implemented in python) which solves this problem? If not, can you help me with an outline for how to implement this efficiently?

like image 417
pehrs Avatar asked Mar 18 '15 09:03

pehrs


2 Answers

Instead of using a sorted list as the implementation of your priority queue, you should be using a heap. A binary heap, for example, has log(n) insertion and deletion. A Fibonacci heap will have O(1) insertion.

You have a implementation inside the standard library: the heapq module.

For your use case, you'll need to keep multiple heaps: one for each sorting order. For clarity of implementation, you may want to hold your original data on a dictionary (possibly using random or incrementing keys) and only keep its keys on the heaps.

Using this technique, you can easily get O(1) insertion and O(log(n)) deletion. You don't really specify what kind of accesses you'll be having, but if you need random access, you can use binary search on the appropriate heap to get O(log(n))

like image 60
loopbackbee Avatar answered Oct 09 '22 23:10

loopbackbee


One approach is to maintain n list internally, one for each sorting order, and to add/remove item for each of them. This way, you "only" multiply the manipulation time of your data structure by a constant value (i.e. if using 3 keys instead of one, then got 3 log(n) instead of log(n) to delete/insert an element).

The concept I imagine behind such implementation is the java's Comparator one.

Create for each key a comparator method to use to sort them, and then use it at insert and deleting time.

It would work as follow :

class SortedList(list):
    def __init__(self, comparator = None):
        list.__init__(self)

        self.comparator = comparator

    def add(self, element):
        """ Adds the element into the sorted list.
        If no comparator where provided at initialisation, then try to compare
        the element to item already stored in the list using standard __gt__ 
        and __eq__.

        Argument :
            element : the element to insert in the list

        /!\ NB : A more inteligent implementation should be used, such as binary search... /!\
        """
        index = 0
        while (index < len(self)):
            if self.isGreatterThan(element, self[index]):
                index += 1
            else:
                break

        self.insert(index, element)

    def remove(self, element):
        """ Same as previous, a better approach is possible that use binary search """
        list.remove(self, element)


    def isGreatterThan(self, element, otherElement):
        """ Compare if element is greater than other element. """
        if self.comparator != None:
            return self.comparator(element, otherElement)
        else:
            return element.__gt__(otherElement)


class MultipleKeysObjectContainer():
    def __init__(self, comparators):
        #register the comparators
        self.comparators = comparators
        #create a List for each comparator
        self.data = {}
        for key in self.comparators.keys():
            self.data[key] = SortedList(comparator = self.comparators[key])

    def add(self, element):
        for key in self.data.keys():
            self.data[key].add(element)

    def __repr__(self):
        return "<MultipleKeysObjectContainer :"+self.data.__repr__()+">"

    def __str__(self):
        result = "MultipleKeysObjectContainer{\n"
        for key in self.data.keys():
            result += "\tOrder by : "+key+"{\n"
            for element in self.data[key]:
                result += "\t\t" + str(element) + "\n"
            result += "\t}\n"
        result += "}"
        return result

    def popOrderedBy(self, key, position):
        """ pop the item a the position in the list of item ordered by the key.
        Remove also from other data containers. """
        item = self.data[key].pop(position)
        for dataKey in self.data.keys():
            if dataKey != key:
                self.data[dataKey].remove(item)

        return item



if __name__ == "__main__":
    a = SortedList(lambda x,y : x[0][0] > y[0][0])

    item1 = ((1, "z", 1.5),"foo")
    item2 = ((2, "a", 40.0), "bar")
    item3 = ((3, "f", 0.5), "barfoo")

    a.add(item1)
    a.add(item3)
    a.add(item2)
    print("Example of sorted list")
    print(a)
    a.remove(item3)
    print("The same list without the barfoo")
    print(a)


    b = MultipleKeysObjectContainer({"id": (lambda x,y : x[0][0] > y[0][0]), "letter": (lambda x,y : x[0][1] > y[0][1] ), "value":(lambda x,y : x[0][2] > y[0][2])})

    b.add(item1)
    b.add(item3)
    b.add(item2)
    print("A multiple key container, object are ordered according three criterion.")
    print(b)
    print("Remove the first item if items are ordered by letter", b.popOrderedBy("letter", 0))
    print("After this removing the container contains :")
    print(b)

This results in :

Example of sorted list
[((1, 'z', 1.5), 'foo'), ((2, 'a', 40.0), 'bar'), ((3, 'f', 0.5), 'barfoo')]
The same list without the barfoo
[((1, 'z', 1.5), 'foo'), ((2, 'a', 40.0), 'bar')]
A multiple key container, object are ordered according three criterion.
MultipleKeysObjectContainer{
    Order by : id{
        ((1, 'z', 1.5), 'foo')
        ((2, 'a', 40.0), 'bar')
        ((3, 'f', 0.5), 'barfoo')
    }
    Order by : value{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
        ((2, 'a', 40.0), 'bar')
    }
    Order by : letter{
        ((2, 'a', 40.0), 'bar')
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
}
Remove the first item if items are ordered by letter ((2, 'a', 40.0), 'bar')
After this removing the container contains :
MultipleKeysObjectContainer{
    Order by : id{
        ((1, 'z', 1.5), 'foo')
        ((3, 'f', 0.5), 'barfoo')
    }
    Order by : value{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
    Order by : letter{
        ((3, 'f', 0.5), 'barfoo')
        ((1, 'z', 1.5), 'foo')
    }
}

That look like you are looking for ( almost, just have to add binary search :p ) Good luck!

like image 31
Arthur Vaïsse Avatar answered Oct 09 '22 22:10

Arthur Vaïsse