I know the intuition behind constraint programming, so to say I never really experienced programming using a constraint solver. Although I think it is a different situation to be able to achieve what we would define as consistent data.
Context:
We have a set of rules to implement on a ETL server. These rules are either:
The third case is different from the second, as it holds when the 2nd case holds but for a well defined number of runs. It might be applied for one single run (one file), or between (1 to n (previous) or on All files).
Technically as we conceived the ETL, it has no memory between two runs: two files (but this is to be re-thought)
For the application of the third kind of rule, ETL needs to have memory (I think we would end-up back-upping data in ETL); Or by re-checking infinitely (a Job) on the whole database after some time window, So data ending up in database do not necessarily fulfill the third kind of rule in-time.
Example:
While we have a continuous flowing data, we apply constraints to have a whole constrained database, the next day we will receive a backup or a correction data for say one month, for this time window, we would like to have constraints satisfied for only this run (this time window), without worrying about the whole database, for future runs all data should be constrained like before without worrying about past data. You can imagine other rules that could fit Temporal logic.
For now, we only have the first kind of rules implemented. The way I thought of it is to have a minified database (of any kind: MySQL, PostgreSQL, MongoDB ...) that back-up all Data (only constrained columns, probably with hashed values) with flags referring to consistency based on earlier kind of rules.
Question: Are there any solutions / conception alternatives that would ease this process ?
To illustrate in a Cook programming language; An example of a set of rules and following actions:
run1 : WHEN tableA.ID == tableB.ID AND tableA.column1 > tableB.column2
BACK-UP
FLAG tableA.rule1
AFTER run1 : LOG ('WARN')
run2 : WHEN tableA.column1 > 0
DO NOT BACK-UP
FLAG tableA.rule2
AFTER run2 : LOG ('ERROR')
Note: While constraint programming is in theory a paradigm for solving combinatorial problems and in practice can speed problem development and execution; I think this is different than a constraint solving problem; As the first purpose is not for optimizing constraints before resolution, probably not even limiting data domains; It's main concern is to apply rules on data reception and execute some basic actions (Reject a line, Accept a line, Logging...).
I really hope this is not a very broad question and this is the right place.
A primary key constraint is a column or combination of columns that has the same properties as a unique constraint. You can use a primary key and foreign key constraints to define relationships between tables.
Lack of funding has been a major constraint on the building's design. They demand freedom from constraint. They refuse to work under constraint any longer.
Constraints can be categorized into five types: A NOT NULL constraint is a rule that prevents null values from being entered into one or more columns within a table. A unique constraint (also referred to as a unique key constraint) is a rule that forbids duplicate values in one or more columns within a table.
There are five types of constraints in SQL Server: Primary Key Constraint, Foreign Key Constraint, Unique Constraint, Check Constraint and Default Constraint .
I found a sophisticated solution to achieve more than what I thought; talking about checking data consistency. Apparently this is what we would call test-driven data analysis
So now with this implementation we are bound to Python, and Pandas, but fortunately, not only. We can even check data consistency in MySQL, PostgreSQL ... tables.
The plus I did not think about, is that we can infer rules based on sample data. This could be helpful for setting rules.
This is why there is tdda.constraints.verify_df
and the tdda.constraints.discover_df
.
As far as I read about, It does not propose a solution for checking (a weaker) consistency on last (n) files. Something I thought about that we could call batch files consistency, that only ensures a rule satisfaction for some set of runs (last n runs) and not all data. It only acts on single files, it needs a higher level wiring to be able to condition (n) files that arrive successively.
For more: https://tdda.readthedocs.io/en/latest/constraints.html#module-tdda.constraints
assertCSVFilesCorrect
Checks a set of files in a directory, same is possible for Pandas dataframes, etc.
From the official documentation:
The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.
ps: I am still open to other solutions, let me know as I imagine this is a use case for any ETL solution.
I also open a bounty to further enrich responses.
You can also look into SQL transactions
. A transaction consists of one or more statements, which are asked to be executed by a single user or an application. They can read or even modify data in a database.
START TRANSACTION
Do DB stuff, check if constraints are violated
COMMIT
You can specify certain constrains and use ROLLBACK
if one of these constraints is violated. The rollback can can be explicitly coded by the developer but can be
thrown from the system as well. (e.g. when an error appeared that is not handled explicitly by the developer, or when executing a trigger). Transactions may not be in the way of each other. They have to be executed in „isolated“ manner. several concurrent transactions
must produce the same results in the data as those same transactions executed sequentially, in some (unspecified) order. Since all modern DBMS guarantee ACID properties when it comes to transactions, the execution of transactions is reliable, so the state of your database shouldn't have any inconsistencies in it.
Not sure if this is what you mean, but maybe it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With