Looking for an efficient way to store history data

Tags:

The data is a python dict representing a state of something that changes slowly over the time. Values change often, usually one or two items at a time. The keys can change too, but that's a rare event. After each change the new data set is remembered for future examination.

The result is a long sequence with increasing timestamps. A very simple example of "b" turned on and off and on again:

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

This sequence is very convenient to work with, but obviously quite inefficient. Almost the same data is copied over and over. The real dict has about 100 items. That's why I'm looking for a different way to store the data history both in memory and on a disk.

I'm pretty sure this has been addressed many times in the past. Is there any standard/recommended way for this problem? The solution doesn't have to be perfect. Good enough is good enough.

This is what I would do unless some kind soul shows a better approach. Storing just incremental changes is space efficient:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, INCREMENTAL, {'b':False}),
(timestamp4, INCREMENTAL, {'b':True}),

However the data is not easy to access, because it must be restored in several steps from last FULL state. To limit the drawback, every N-th record will be stored as FULL, all others as INCREMENTAL.

I would probably add this small improvement: adding a reference to the same state already recorded in order to prevent duplication:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, SAME_AS, timestamp1),
(timestamp4, SAME_AS, timestamp2),

508

asked Jul 19 '16 18:07

VPfB

1 Answers

A more space-efficient approach is to keep a set for each "column" of data. That is, we keep a set for columns a, b, and c. The set keeps track of the timestamps for which the column's value is True. For instance, for the data:

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

the set for column a will be empty, the set for column b will contain timestamps 2 and 4, and the set for column c will again be empty.

Note that this is more-or-less the approach one might take to store a sparse binary vector. Rather than store the entire vector, we just keep track of where the vector is 1. In fact, you might want to consider using a sparse matrix data type from SciPy.

Sets offer efficient (constant time) membership lookup, so this is also a time-efficient way of doing this.

To make the data easy to access you can write a small class which wraps the sets. For example:

class SparseStates(object):

    def __init__(self, columns):
        self.data = {col: set() for col in columns}

    def __getitem__(self, key):
        row, column = key
        return row in self.data[column]

    def turn_on(self, row, column):
        self.data[column].add(row)

Usage:

>>> states = SparseStates(['a', 'b', 'c'])
>>> states.turn_on(2, 'b')
>>> states.turn_on(4, 'b')
>>> states[2, 'a']
False
>>> states[2, 'b']
True
>>> states.data['a']
{}
>>> states.data['b']
{2, 4}

answered Oct 23 '22 06:10

jme

Related questions
                            
                                How to compress a large file in Python?
                            
                                Computationally picking a random point on a n-sphere
                            
                                Class attribute shadowing in Python class [duplicate]
                            
                                Logging with WSGI server and flask application
                            
                                np.ndarray with Periodic Boundary conditions
                            
                                What's the order of __hash__ and __eq__ evaluation for a Python dict?
                            
                                Persist authenticated session between crawls for development in Scrapy
                            
                                Calculate F-distribution p values in python?
                            
                                Pycharm debugger, view as array option
                            
                                pandas append same series to each column
                            
                                Using groupby ("1d") and first_valid_index together
                            
                                What are the best practices for sharing a module among subpackages?
                            
                                python socket.error: [Errno 9] Bad file descriptor
                            
                                Scope of Spark's `persist` or `cache`
                            
                                Qcut Pandas : ValueError: Bin edges must be unique
                            
                                Requiring only one of two dependencies in a requirements file
                            
                                Assign/map colors to the points in Seaborn.regplot (Python 3)
                            
                                Are there anything similar to "perl -pe" option in python?
                            
                                Using Popen in a thread blocks every incoming Flask-SocketIO request
                            
                                Python enumerate reverse index only

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Looking for an efficient way to store history data

Tags:

python

storage

VPfB

People also ask

1 Answers

jme

Recent Activity

Donate For Us