Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modelling Metadata about a Mathematical Calculation in Neo4j

I am new to the forum and just getting started with Neo4J. Apologies for my long winded question and the background information but I think it helps to explain what I am trying to underatand.

I often work on Business Intelligence and Data Warehouse projects for companies. When we create Business Intelligence Requirements we typically need to create a list of the Business Metrics that we are interested in (things like Sales Revenue, Profit Ratio, Total Expenses) and document how these Business Metrics are calculated using data attributes from our underlying systems. Typically we document most of this work in excel in the form of data requirements spreadsheets. We create a list of business metrics and then a stack of columns with a description, source data attributes, calculation etc. What I am trying to do (as a personal side project) is develop an application that we can use to document this type of metadata information instead. I have read a few of the Neo4j books and online articles and I think that Neo4j is well suited to this Use Case and right now I am trying to document a basic data model to help me get started.

At first I came up with something fairly straight forward as shown in the image on the left below starting from the point that:

Sales Revenue = Unit_Price * Count_Units_Sold

First Attempt at Modelling Metrics & Attributes

However I quickly realised that the calculation itself is very important to me and that I might at a later point want to capture more information about it such as adding different versions of a calculation or adding notes to further describe it. I modified the model to make the “calculation itself” a separate node as per the image on the left above.

However, when I start to look at more complex metrics I am still not sure how best to represent the details of the calculation. If I take the below example I would model it as follows.

Salary = Salary_Amount + Overtime_Amount – Tax Amount

More Complex Example

Now this clearly represents the data attributes (3 or them) that are used in the calculation but I don’t know how to represent the calculation itself. E.g. to define that the calculation is done by first adding Salary_Amount to Overtime_Amount and then subtracting Tax_Amount. When I have more complex calculation involving division and multiplication which need to be performed in a particular order this will get even more complex. Essentially I want to be able to infer from the model that the calculation is as follows:

Salary = Salary_Amount + Overtime_Amount – Tax Amount

As opposed to:

Salary = Salary_Amount * Tax Amount / Overtime Amount

Or:

Salary = Tax Amount * Overtime Amount - Salary Amount

I am looking for some way to define the Calculation Node whereby I can apply an ordering to the way the data attributes are used. It might be that I should just store the calculation as a text string in the property of the calculation but I can’t help but think that this could cause me pain down the road and limit my ability to get usefulness information from the graph when multiple data attributes are used in different calculations.

Note: I did see this question on the forum that is along a similar topic but didn't receive many responses so even though my question is similar I though that providing some more background information might bring some further insights.

Thanks a lot, Michael


I am editing this question after reviewing the answers by @ChristopheWillemsen and @stdob--.

Firstly thanks a lot to both contributors. The answers and reference material were really helpful and both covered my requirements. Initially I had leaned towards the use of Reverse Polish Notation as per the answer from @stdob—because it offered a neat way to handle grouped operations (e.g. parentheses in my mathematical formulas). However, after trying to model my data in both ways I found that I had additional requirements that I did not cover in my first post which is to capture logical expressions such as “If, Where, Having). Basically I want to be able to capture ETL type transformation rules which goes beyond pure mathematical expressions and I think that the solution by @ChristopheWillemsen will support this.

Here is how I have modelled my basic formulas using this approach:

Basic Calc following Method 1

However, I also have more complex logic that I want to model. These are ETL type rules that would typically be captured as pseudo code or in the form of SQL when defining business requirements for a data warehouse or BI project. Below is an example where I am defining the logic for how an ETL could calculate the New Claims Count Metric for an Insurance Company.

New Claims Count Calculation

This is then how I have modelled this extending on the solution that @ChristopheWillemsen provided in the first answer below.

New Claims Count Modelled

Could you take a look at this and see if this is an appropriate way to model this. From a requirements point of view I will want to be able to:

  • Reconstruct the logic so that I can present it back to end users
  • Answer questions such as which metrics this attribute is needed for.
  • Carry out what-if-analysis (e.g. if an attribute value changes what is the impact on metrics that use this attribute.

Does this look like an appropriate approach to model this type of information? Any suggestions or improvements would be welcome?

like image 504
n4nite Avatar asked Feb 03 '17 10:02

n4nite


2 Answers

This is a very interesting use case and to me it comes close to what we call Rules Engines.

I posted a use case about it on the neo4j blog : https://neo4j.com/blog/uncommon-use-cases-graph-databases/

Of course there are multiple ways of achieving what you want and I will share one way I have in mind.

I would treat calculations as an ordered list of Operations which different natures are defined by their label. For example you would have an Operation node having an additional label Addition and its next operation can be an Operation node with a label Substraction.

A simple model could be represented like this :

enter image description here

Your Operation nodes would then reference the incoming value they are using.

In a more complex situation, you would like to represent group of operations which can defined a mathematic grouping between parentheses, again a model can be done like this :

enter image description here

The possibilities are almost infinite.

Note that in computer science, this technique is also known as the Specification Pattern : https://www.martinfowler.com/apsupp/spec.pdf

like image 76
Christophe Willemsen Avatar answered Nov 15 '22 11:11

Christophe Willemsen


The first option is to write the expression in Reverse Polish Notation, and store it in an ordered tree:

Salary_Amount * Tax_Amount / Overtime_Amount
=>
Salary_Amount Tax_Amount * Overtime_Amount /

enter image description here


The second option that comes to mind: keep the formula in the form of text, and send the formula and value of parameters in any scripting language to run. For example - in javascript eval.


I recommend also to read this article: Spreadsheets Are Graphs Too


Upd.: The idea of how to use the cypher and apoc-library to calculate formulas:

WITH "{Salary_Amount} * {Tax_Amount} / {Overtime_Amount}" as Formula
CALL apoc.cypher.run("return " + Formula + " as value", {
  Salary_Amount: 1000,
  Tax_Amount: 0.49,
  Overtime_Amount: 100
}) yield value as result
RETURN result.value
like image 26
stdob-- Avatar answered Nov 15 '22 10:11

stdob--