Suppose you're interested in a bunch of independent time-varying values, each of which represents the current state of something. The values don't change on any fixed schedule and new values can't be predicted from old ones. For the sake of a concrete example, let's say you have a bunch of stocks and you're interested in keeping track of their values, and you get an update about an individual stock whenever a trade is made on that stock. (My actual problem isn't about stocks, but hopefully they make what I'm getting at more understandable.)
In addition to just knowing the current price of each stock, you'd also like to be able to pick an arbitrary point in the past and get a "snapshot" that told you what the most recent trading price for each stock was at that time. So for instance, you should be able to say "What was the most recent value of every stock I'm tracking as of 4:53PM last Tuesday?" and get a precise answer efficiently.
I can think of three ways to do this, but I'm not very happy with any of them.
1. Keep a journal. Maintain a list of all trades in time-sequence order. Update is just adding to the list, and query is a linear scan backwards in time starting with the first entry whose timestamp is on or earlier than the specified time stamp. This would make update a constant time operation, but you might potentially have to scan the entire journal to find a value for all trades, so Update is O(1) and Snapshot is O(u) where u is the total number of updates. Memory required is O(u) for obvious reasons.
2. Write checkpoints. Maintain a single journal as before, but instead of each entry containing just new stock price, the update contains the current price (as of that update) for every stock. This is cheap to calculate: since the last update has all that information too, you just copy it all over except for the one stock whose price actually changed. Now Snapshot can be done with an O(log u) operation (using binary search on the journal to locate the last entry that comes before or on the specified timestamp). However Update becomes O(s) where s is the number of stocks in the system, and furthermore the total memory required goes from O(u) in the first strategy to O(s*u) --- both of which are problems if both s and u are large.
3. Separated journals. Maintain a separate journal for each stock, and write updates for each stock into its own journal, again in chronological order. To snapshot, go over each journal and use a binary search to find the correct update. It requires O(u) memory, Update is an O(1) operation, and Snapshot can be done in O(s * log u) time. This is my favorite method of the three, but I feel like it could probably be improved upon, since it ignores any relationship between the timing of updates across different stocks.
Is there a better way I'm missing? Is this a problem that has been studied and has a generally-accepted solution?
Update operation refers to updating an existing element from the array at a given index.
An array is a mutable structure as its contents can be edited, deleted or moved. A string is an immutable structure as its contents cannot be changed without creating a new string.
There are different data structures based on hashing, but the most commonly used data structure is the hash table. Hash tables are generally implemented using arrays.
The binary tree is used to store elements in sorted order, which makes searching and retrieval easier. Tree data structures have so many types such as Binary Tree, Binary Search Tree, Red-Black trees, B-tree, Weighted-balanced tree, and Heap. It has so many applications in database management.
Have a look at the literature on Persistent Data Structures. In particular, this early paper describes the construction of a persistent binary search tree that maintains logarithmic operations, but can be accessed at any version (e.g. point in time). Accesses to parts of the structure that weren't updated in some particular version naturally look to the last preceding version. So, you would have your natural operations in O(log s) time, and the structure could occupy O(u) space if you know all your keys up front and never have to rebalance, or O(u * log s) space if every update modified O(log s) pointers.
These class notes seem to describe what you would need to implement in fairly simple terms.
I doubt you will find a solution that is excellent in all measures. What you choose depends largely on what tradeoffs you're willing to make. If snapshots are infrequent, #3 is great; if they're frequent, probably not: O(S log U) could be a killer for a source control repository, for instance.
Here are some other ideas off the top of my head:
4. Periodic checkpoints. At a specified interval (every x hours, every y updates, whatever) make a checkpoint containing the current price for every stock. Figuring out the data at a past point in time means finding the most recent snapshot before that time and then adding in the individual updates afterward. This would have the same asymptotic performance as #2, but the constant of multiplication for updates and memory usage would be a lot lower since you'd be taking a lot fewer snapshots.
5. Delta-only checkpoints. Same as #4, but don't take a snapshot of the entire system. Instead store only the items that have changed since the previous checkpoint. The unchanged entries will be looked up in previous checkpoints. This saves a lot of time when writing a checkpoint and greatly reduces memory usage. If ΔU is the average number of updates between checkpoints then both of these are now O(ΔU). This would effectively be a fixed amount; the database would grow over time, but not the average number of updates per checkpoint. You might consider updates time as amortized O(1) and memory usage as O(U), then.
For what it's worth, some years ago I wrote a wiki clone. One of the problems I encountered was how to store the page deltas. Do I store just the diffs, or do I store the full page text each update? How do I balance speed versus memory usage? Applying dozens or hundreds of diffs in a row to reconstruct a page could be overly slow, but storing the entire page if someone only changes one sentence would be quite wasteful.
I wanted something that would scale well even for large, frequently updated pages.
I ended up with a hybrid approach similar to #5. I store diffs with periodic full page snapshots. To figure out when to take the snapshots, I compare the new page text to the most recent snapshot's text. If the diff size is more than half as large as a full page text I store the full page text instead of the diff. That way, if people are making small updates then I can store diffs, but eventually once the page has changed enough I'll take a new snapshot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With