Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ideal data structure with fast lookup, fast update and easy comparison/sorting

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:

  • Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
  • If it is, update the timestamp to "now"
  • If not, add it to the set with timestamp "now"

Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).

I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.

Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).

Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

like image 786
Christian P. Avatar asked Oct 09 '13 15:10

Christian P.


People also ask

What are the different types of searching in data structure?

Searching in Data Structure. 1 1. Sequential Search. This is the traditional technique for searching an element in a collection of elements. In this type of search, all the elements ... 2 2. Binary Search. 3 1. Set LOC = -1,i=1. 4 2. Repeat while DATA [i] != ITEM: 5 3. If i=N+1 ,then Set LOC =0. More items

Do I need a sorted data structure for a binary search?

If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O (1) lookups. Show activity on this post. Binary search tree - it contains sorted data so adding new elements is costly (O (log n) I think).

How to optimize your data structure?

There are algorithms used with specific data structure, where some other can’t be used. The more efficient & suitable the algorithm, the more you will have an optimized data structure. Chances are, you’re going to rely on the built-in algorithms used with the data structures in your language.

Why are there different algorithms for different data structures?

Each data structure has it’s own different way, or different algorithm for sorting, inserting, finding, …etc. This is due to the nature of the data structure. There are algorithms used with specific data structure, where some other can’t be used. The more efficient & suitable the algorithm, the more you will have an optimized data structure.


Video Answer


4 Answers

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).

By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.

Figuring out the details (and implementing them) is left as an exercise for the reader ...


Caveats:

  • worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
  • any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp

A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.

Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

like image 44
Useless Avatar answered Oct 17 '22 04:10

Useless


This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).

Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.

So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:

import heapq


class ExpiringCache(object):
    def __init__(self):
        self._dict = {}
        self._heap = []

    def add(self, key, expiry):
        self._dict[key] = expiry
        heapq.heappush(self._heap, (expiry, key))

    def contains(self, key):
        return key in self._dict

    def collect(self, maxage):
        while self._heap and self._heap[0][0] <= maxage:
            expiry, key = heapq.heappop(self._heap)
            if self._dict.get(key) == expiry:
                del self._dict[key]

    def items(self):
        return self._dict.items()

create a cache and add some items

>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)

re-add an item with an even later expiry

>>> xc.add('apples', 4)

collect everything "older" than two time units

>>> xc.collect(2)    
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False
like image 72
SingleNegationElimination Avatar answered Oct 17 '22 06:10

SingleNegationElimination


Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.

This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.

Edit:

For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.

like image 2
John Y Avatar answered Oct 17 '22 05:10

John Y


If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)

http://en.wikipedia.org/wiki/Bloom_filter

and a python implementation: https://github.com/axiak/pybloomfiltermmap

EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

like image 1
Cameron Sparr Avatar answered Oct 17 '22 04:10

Cameron Sparr