Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for an efficient way to store history data

Tags:

python

storage

The data is a python dict representing a state of something that changes slowly over the time. Values change often, usually one or two items at a time. The keys can change too, but that's a rare event. After each change the new data set is remembered for future examination.

The result is a long sequence with increasing timestamps. A very simple example of "b" turned on and off and on again:

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

This sequence is very convenient to work with, but obviously quite inefficient. Almost the same data is copied over and over. The real dict has about 100 items. That's why I'm looking for a different way to store the data history both in memory and on a disk.

I'm pretty sure this has been addressed many times in the past. Is there any standard/recommended way for this problem? The solution doesn't have to be perfect. Good enough is good enough.


This is what I would do unless some kind soul shows a better approach. Storing just incremental changes is space efficient:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, INCREMENTAL, {'b':False}),
(timestamp4, INCREMENTAL, {'b':True}),

However the data is not easy to access, because it must be restored in several steps from last FULL state. To limit the drawback, every N-th record will be stored as FULL, all others as INCREMENTAL.

I would probably add this small improvement: adding a reference to the same state already recorded in order to prevent duplication:

(timestamp1, FULL, {'a':False, 'b':False, 'c':False}),
(timestamp2, INCREMENTAL, {'b':True}),
(timestamp3, SAME_AS, timestamp1),
(timestamp4, SAME_AS, timestamp2),
like image 508
VPfB Avatar asked Jul 19 '16 18:07

VPfB


People also ask

What should you recommend for historical data storage?

A Data Warehouse is a nice solution for this kind of situation. ETL will give you lots of options in dealing with data flows. Your basic concept of 'History' vs 'Active' is quite correct. Your history data will be more efficient and flexible if kept in a data warehouse with all their dimension and fact tables.

Where is historical data stored?

Data is maintained in Data Warehouse according to a schedule. As data gets older, the data record retention is reduced. Data Warehouse retains historical data based on the data marts and granularity of the data, as shown in the following table.

What is historical data in database?

Historical data, in a broad context, is collected data about past events and circumstances pertaining to a particular subject. By definition, historical data includes most data generated either manually or automatically within an enterprise.

Why is it important to keep historical data?

Historical data enables the tracking ofimprovement over time which gives key insights. These insights are essential for driving a business. Marketers are always on the run to better understand and segment the customers. Keeping historical data can help marketers understand iftheir customer segment is changing.


1 Answers

A more space-efficient approach is to keep a set for each "column" of data. That is, we keep a set for columns a, b, and c. The set keeps track of the timestamps for which the column's value is True. For instance, for the data:

(timestamp1, {'a':False, 'b':False, 'c':False}),
(timestamp2, {'a':False, 'b':True, 'c':False}),
(timestamp3, {'a':False, 'b':False, 'c':False}), 
(timestamp4, {'a':False, 'b':True, 'c':False}),

the set for column a will be empty, the set for column b will contain timestamps 2 and 4, and the set for column c will again be empty.

Note that this is more-or-less the approach one might take to store a sparse binary vector. Rather than store the entire vector, we just keep track of where the vector is 1. In fact, you might want to consider using a sparse matrix data type from SciPy.

Sets offer efficient (constant time) membership lookup, so this is also a time-efficient way of doing this.

To make the data easy to access you can write a small class which wraps the sets. For example:

class SparseStates(object):

    def __init__(self, columns):
        self.data = {col: set() for col in columns}

    def __getitem__(self, key):
        row, column = key
        return row in self.data[column]

    def turn_on(self, row, column):
        self.data[column].add(row)

Usage:

>>> states = SparseStates(['a', 'b', 'c'])
>>> states.turn_on(2, 'b')
>>> states.turn_on(4, 'b')
>>> states[2, 'a']
False
>>> states[2, 'b']
True
>>> states.data['a']
{}
>>> states.data['b']
{2, 4}
like image 59
jme Avatar answered Oct 23 '22 06:10

jme