Constraint database

Tags:

I know the intuition behind constraint programming, so to say I never really experienced programming using a constraint solver. Although I think it is a different situation to be able to achieve what we would define as consistent data.

Context:

We have a set of rules to implement on a ETL server. These rules are either:

acting on one row.
acting inter-rows, in one or different tables.
acting the same way between two runs (It should maintain the same constraint on all data, or just the last n runs);

The third case is different from the second, as it holds when the 2nd case holds but for a well defined number of runs. It might be applied for one single run (one file), or between (1 to n (previous) or on All files).

Technically as we conceived the ETL, it has no memory between two runs: two files (but this is to be re-thought)

For the application of the third kind of rule, ETL needs to have memory (I think we would end-up back-upping data in ETL); Or by re-checking infinitely (a Job) on the whole database after some time window, So data ending up in database do not necessarily fulfill the third kind of rule in-time.

Example:

While we have a continuous flowing data, we apply constraints to have a whole constrained database, the next day we will receive a backup or a correction data for say one month, for this time window, we would like to have constraints satisfied for only this run (this time window), without worrying about the whole database, for future runs all data should be constrained like before without worrying about past data. You can imagine other rules that could fit Temporal logic.

For now, we only have the first kind of rules implemented. The way I thought of it is to have a minified database (of any kind: MySQL, PostgreSQL, MongoDB ...) that back-up all Data (only constrained columns, probably with hashed values) with flags referring to consistency based on earlier kind of rules.

Question: Are there any solutions / conception alternatives that would ease this process ?

To illustrate in a Cook programming language; An example of a set of rules and following actions:

run1 : WHEN tableA.ID == tableB.ID AND tableA.column1 > tableB.column2
       BACK-UP 
       FLAG tableA.rule1
AFTER run1 : LOG ('WARN')

run2 : WHEN tableA.column1 > 0
       DO NOT BACK-UP 
       FLAG tableA.rule2
AFTER run2 : LOG ('ERROR')

Note: While constraint programming is in theory a paradigm for solving combinatorial problems and in practice can speed problem development and execution; I think this is different than a constraint solving problem; As the first purpose is not for optimizing constraints before resolution, probably not even limiting data domains; It's main concern is to apply rules on data reception and execute some basic actions (Reject a line, Accept a line, Logging...).

I really hope this is not a very broad question and this is the right place.

762

asked Oct 02 '19 10:10

Curcuma_

2 Answers

I found a sophisticated solution to achieve more than what I thought; talking about checking data consistency. Apparently this is what we would call test-driven data analysis

So now with this implementation we are bound to Python, and Pandas, but fortunately, not only. We can even check data consistency in MySQL, PostgreSQL ... tables.

The plus I did not think about, is that we can infer rules based on sample data. This could be helpful for setting rules. This is why there is tdda.constraints.verify_df and the tdda.constraints.discover_df.

As far as I read about, It does not propose a solution for checking (a weaker) consistency on last (n) files. Something I thought about that we could call batch files consistency, that only ensures a rule satisfaction for some set of runs (last n runs) and not all data. It only acts on single files, it needs a higher level wiring to be able to condition (n) files that arrive successively.

For more: https://tdda.readthedocs.io/en/latest/constraints.html#module-tdda.constraints

assertCSVFilesCorrect Checks a set of files in a directory, same is possible for Pandas dataframes, etc.

From the official documentation:

The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.

ps: I am still open to other solutions, let me know as I imagine this is a use case for any ETL solution.

I also open a bounty to further enrich responses.

answered Sep 26 '22 05:09

Curcuma_

You can also look into SQL transactions. A transaction consists of one or more statements, which are asked to be executed by a single user or an application. They can read or even modify data in a database.

START TRANSACTION
Do DB stuff, check if constraints are violated
COMMIT

You can specify certain constrains and use ROLLBACK if one of these constraints is violated. The rollback can can be explicitly coded by the developer but can be thrown from the system as well. (e.g. when an error appeared that is not handled explicitly by the developer, or when executing a trigger). Transactions may not be in the way of each other. They have to be executed in „isolated“ manner. several concurrent transactions must produce the same results in the data as those same transactions executed sequentially, in some (unspecified) order. Since all modern DBMS guarantee ACID properties when it comes to transactions, the execution of transactions is reliable, so the state of your database shouldn't have any inconsistencies in it.

Not sure if this is what you mean, but maybe it helps.

answered Sep 25 '22 05:09

Psychotechnopath

Related questions
                            
                                SQLite to Oracle
                            
                                Idiomatic haskell for database abstraction
                            
                                Connecting Oracle to SQL Server via database link
                            
                                Modeling 3 entities with relationships
                            
                                Entity Framework 5.0 PostgreSQL (Npgsql) default connection factory
                            
                                What is better- Add an optional parameter to an existing SP or add a new SP?
                            
                                When using Continuous or Automated Deployment, how do you deploy databases?
                            
                                H2 DB - Column must be in Group By list
                            
                                Get last message from each conversation
                            
                                Why does an insert that groups by the primary key throw a primary key constraint violation error?
                            
                                Keeping partly-offline sqlite db in sync with postgresql
                            
                                Join elimination not working in Oracle with sub queries
                            
                                How to properly use transactions and locks to ensure database integrity?
                            
                                ERROR: syntax error at or near "SELECT"
                            
                                Which is the best practice to store a huge number (10000+) of DIFFERENT object types into a database?
                            
                                rake gems:install shows error (database is not migrating)/
                            
                                MySQL generate UUID() for multiple rows
                            
                                Mysql take dump of some portion like 10-20 % of whole database
                            
                                Slow query at one DB, but fast at his copy
                            
                                Django Models for Time Series Data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Constraint database

Tags:

validation

database

constraints

etl

Curcuma_

People also ask

2 Answers

Curcuma_

Psychotechnopath

Recent Activity

Donate For Us