I am looking for a good data structure to contain a list of tuples with <code>(hash, timestamp)</code> values. Basically, I want to use it in the following way: <ul> <li>Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).</li> <li>If it is, update the timestamp to "now"</li> <li>If not, add it to the set with timestamp "now"</li> </ul> Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python <code>datetime</code> object, or some other easy-to-compare hash/string). I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes. Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values). Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key). By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree. Figuring out the details (and implementing them) is left as an exercise for the reader ... <hr> Caveats: <ul> <li>worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash</li> <li>any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp</li> </ul> <hr> A simpler approach is to store an object containing <code>(hash, timestamp, prev, next)</code> in a regular dict, using <code>prev</code> and <code>next</code> to keep an up-to-date doubly-linked list. Then all you need alongside the dict are <code>head</code> and <code>tail</code> references. Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

Unless I'm misreading your question, a plain old <code>dict</code> should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold <code>(timestamp, hash)</code> pairs. This supplemental data structure could either be a plain old <code>list</code> or a <code>deque</code> (from the <code>collections</code> module). Possibly the <code>bisect</code> module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing. Edit: For Python 2.7 or 3.1+, you could also consider using <code>OrderedDict</code> (from the <code>collections</code> module). This is basically a <code>dict</code> with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with <code>del</code>) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of <code>(timestamp, hash)</code> pairs yourself; when it comes time to purge, you can just iterate straight through the <code>OrderedDict</code>, deleting entries until you reach one with a timestamp that is later than your cut-off.

If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast) http://en.wikipedia.org/wiki/Bloom_filter and a python implementation: https://github.com/axiak/pybloomfiltermmap EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

Ideal data structure with fast lookup, fast update and easy comparison/sorting

Tags:

python

data-structures

I am looking for a good data structure to contain a list of tuples with (hash, timestamp) values. Basically, I want to use it in the following way:

Data comes in, check to see if it's already present in the data structure (hash equality, not timestamp).
If it is, update the timestamp to "now"
If not, add it to the set with timestamp "now"

Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime object, or some other easy-to-compare hash/string).

I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.

Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).

Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.

786

asked Oct 09 '13 15:10

Christian P.

Video Answer

4 Answers

The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).

By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.

Figuring out the details (and implementing them) is left as an exercise for the reader ...

Caveats:

worst case height - and therefore complexity - is linear. This shouldn't occur with a decent hash
any read-only operations (ie, lookups that don't update the timestamp) will disrupt the relationship between splay tree layout and timestamp

A simpler approach is to store an object containing (hash, timestamp, prev, next) in a regular dict, using prev and next to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head and tail references.

Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.

answered Oct 17 '22 04:10

Useless

This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash in your case) is known to the collection. For this, dict is a very good fit; we'll just map the hash to the timestamp so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash).

Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.

So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:

import heapq


class ExpiringCache(object):
    def __init__(self):
        self._dict = {}
        self._heap = []

    def add(self, key, expiry):
        self._dict[key] = expiry
        heapq.heappush(self._heap, (expiry, key))

    def contains(self, key):
        return key in self._dict

    def collect(self, maxage):
        while self._heap and self._heap[0][0] <= maxage:
            expiry, key = heapq.heappop(self._heap)
            if self._dict.get(key) == expiry:
                del self._dict[key]

    def items(self):
        return self._dict.items()

create a cache and add some items

>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)

re-add an item with an even later expiry

>>> xc.add('apples', 4)

collect everything "older" than two time units

>>> xc.collect(2)    
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False

answered Oct 17 '22 06:10

SingleNegationElimination

Unless I'm misreading your question, a plain old dict should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash) pairs.

This supplemental data structure could either be a plain old list or a deque (from the collections module). Possibly the bisect module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.

Edit:

For Python 2.7 or 3.1+, you could also consider using OrderedDict (from the collections module). This is basically a dict with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash) pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict, deleting entries until you reach one with a timestamp that is later than your cut-off.

answered Oct 17 '22 05:10

John Y

If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)

http://en.wikipedia.org/wiki/Bloom_filter

and a python implementation: https://github.com/axiak/pybloomfiltermmap

EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.

answered Oct 17 '22 04:10

Cameron Sparr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Ideal data structure with fast lookup, fast update and easy comparison/sorting

Tags:

python

data-structures

Christian P.

People also ask

Video Answer

4 Answers

Useless

SingleNegationElimination

John Y

Cameron Sparr

Recent Activity

Donate For Us

Ideal data structure with fast lookup, fast update and easy comparison/sorting

Tags:

python

data-structures

Christian P.

People also ask

Video Answer

4 Answers

Useless

SingleNegationElimination

John Y

Cameron Sparr

Related questions

Recent Activity

Donate For Us