I am looking for a good data structure to contain a list of tuples with (hash, timestamp)
values. Basically, I want to use it in the following way:
Periodically, I wish to remove and return a list of tuples that older than a specific timestamp (I need to update various other elements when they 'expire'). Timestamp does not have to be anything specific (it can be a unix timestamp, a python datetime
object, or some other easy-to-compare hash/string).
I am using this to receive incoming data, update it if it's already present and purge data older than X seconds/minutes.
Multiple data structures can be a valid suggestion as well (I originally went with a priority queue + set, but a priority queue is less-than-optimal for constantly updating values).
Other approaches to achieve the same thing are welcome as well. The end goal is to track when elements are a) new to the system, b) exist in the system already and c) when they expire.
Searching in Data Structure. 1 1. Sequential Search. This is the traditional technique for searching an element in a collection of elements. In this type of search, all the elements ... 2 2. Binary Search. 3 1. Set LOC = -1,i=1. 4 2. Repeat while DATA [i] != ITEM: 5 3. If i=N+1 ,then Set LOC =0. More items
If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O (1) lookups. Show activity on this post. Binary search tree - it contains sorted data so adding new elements is costly (O (log n) I think).
There are algorithms used with specific data structure, where some other can’t be used. The more efficient & suitable the algorithm, the more you will have an optimized data structure. Chances are, you’re going to rely on the built-in algorithms used with the data structures in your language.
Each data structure has it’s own different way, or different algorithm for sorting, inserting, finding, …etc. This is due to the nature of the data structure. There are algorithms used with specific data structure, where some other can’t be used. The more efficient & suitable the algorithm, the more you will have an optimized data structure.
The closest I can think of to a single structure with the properties you want is a splay tree (with your hash as the key).
By rotating recently-accessed (and hence updated) nodes to the root, you should end up with the least recently-accessed (and hence updated) data at the leaves or grouped in a right subtree.
Figuring out the details (and implementing them) is left as an exercise for the reader ...
Caveats:
A simpler approach is to store an object containing (hash, timestamp, prev, next)
in a regular dict, using prev
and next
to keep an up-to-date doubly-linked list. Then all you need alongside the dict are head
and tail
references.
Insert & update are still constant time (hash lookup + linked-list splice), and walking backwards from the tail of the list collecting the oldest hashes is linear.
This is a pretty well trod space. The thing you need is two structures, You need something to tell you wether your key (hash
in your case) is known to the collection. For this, dict
is a very good fit; we'll just map the hash
to the timestamp
so you can look up each item easily. Iterating over the items in order of timestamp is a task particularly suited to Heaps, which are provided by the heapq
module. Each time we see a key, we'll just add it to our heap, as a tuple of (timestamp, hash)
.
Unfortunately there's no way to look into a heapified list and remove certain items (because, say, they have been updated to expire later). We'll get around that by just ignoring entries in the heap that have timestamps that are dissimilar from the value in the dict.
So here's a place to start, you can probably add methods to the wrapper class to support additional operations, or change the way data is stored:
import heapq
class ExpiringCache(object):
def __init__(self):
self._dict = {}
self._heap = []
def add(self, key, expiry):
self._dict[key] = expiry
heapq.heappush(self._heap, (expiry, key))
def contains(self, key):
return key in self._dict
def collect(self, maxage):
while self._heap and self._heap[0][0] <= maxage:
expiry, key = heapq.heappop(self._heap)
if self._dict.get(key) == expiry:
del self._dict[key]
def items(self):
return self._dict.items()
create a cache and add some items
>>> xc = ExpiringCache()
>>> xc.add('apples', 1)
>>> xc.add('bananas', 2)
>>> xc.add('mangoes', 3)
re-add an item with an even later expiry
>>> xc.add('apples', 4)
collect everything "older" than two time units
>>> xc.collect(2)
>>> xc.contains('apples')
True
>>> xc.contains('bananas')
False
Unless I'm misreading your question, a plain old dict
should be ideal for everything except the purging. Assuming you are trying to avoid having to inspect the entire dictionary during purging, I would suggest keeping around a second data structure to hold (timestamp, hash)
pairs.
This supplemental data structure could either be a plain old list
or a deque
(from the collections
module). Possibly the bisect
module could be handy to keep the number of timestamp comparisons down to a minimum (as opposed to comparing all the timestamps until you reach the cut-off value), but since you'd still have to iterate sequentially over the items that need to be purged, ironing out the exact details of what would be quickest requires some testing.
Edit:
For Python 2.7 or 3.1+, you could also consider using OrderedDict
(from the collections
module). This is basically a dict
with a supplementary order-preserving data structure built into the class, so you don't have to implement it yourself. The only hitch is that the only order it preserves is insertion order, so that for your purpose, instead of just reassigning an existing entry to a new timestamp, you'll need to remove it (with del
) and then assign a fresh entry with the new timestamp. Still, it retains the O(1) lookup and saves you from having to maintain the list of (timestamp, hash)
pairs yourself; when it comes time to purge, you can just iterate straight through the OrderedDict
, deleting entries until you reach one with a timestamp that is later than your cut-off.
If you're okay with working around occasional false positives, I think that a bloom filter may suit your needs well (it's very very fast)
http://en.wikipedia.org/wiki/Bloom_filter
and a python implementation: https://github.com/axiak/pybloomfiltermmap
EDIT: reading your post again, I think this will work, but instead of storing the hashes, just let the bloomfilter create the hashes for you. ie, I think you just want to use the bloomfilter as a set of timestamps. I'm assuming that your timestamps could basically just be a set since you are hashing them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With