Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with massive number of Rules (conditions and constraints) CEP systems

I'm working on an application that accepts 100k+ unique inputs - for simplicity lets assume each input is a floating point value(a,b,c...etc) - though they could also be strings etc. The application has many rules/events/triggers that are related to these inputs.

Example1:

Rule[(a > b) and (c <= d)] --> [execute EventX]

Definition1: The above rule says: when the value of input 'a' is greater than that of 'b' and the value of input 'c' is less than or equal to that of 'd' execute EventX

Example2:

Rule[x != x.prev] --> [execute EventZ]

Definition2: The above rule says: If after a value update, if the current value of input 'x' is not equal to its previous value (the value has changed). execute EventZ

What I'm finding is that as the number of inputs increases, and as the number of rules increase, the total amount of computation required to evaluate the 'specific' rules and determine if an event should be triggered is getting out of control - Throwing faster and more h/w at the problem isn't scaling as expected.

At the moment, upon every new input update, the associated input/variable is looked-up in a hash-table that maps the variable to rules that contain it. Each of those rules are then subsequently evaluated, and if the result is true or actionable, the relevant event is triggered asynchronously.

This problem is in the realm of Complex Event Processing, unfortunately most of the architectures in this realm use the same deadhead approach described above - perhaps with a few refinements relating to the frequency with which things are evaluated/reevaluated. The idea is to have a system that can react in near real-time. Distributing the rules and inputs over multiple nodes doesn't seem to work well either, as in some situations a minority of inputs exists in over 95% of the active rules.

I was hoping if there are any fellow SO'ers out there, that know of better methods to solve this problem, data/structures or algorithms.

A simple idea I have in mind is that one could construct a list of dependent logical inferences.

If there are two rules that are like so:

Rule[a < b] --> [exec E1]
Rule[b >= a] --> [exec E2]

Then an update on either 'a' or 'b' should not require the evaluation of both etc. but I find that building such logical inference structures to be highly complex and error prone, and difficult to completely and rigorously test.

Inputs could be representative of things such as stock prices, temperature sensors etc.

One further note, some of the inputs are temporally constrained, meaning that the rule may require the state of a variable to be in a specific position/state for some amount of time (eg: the price of MSFT > $20 for last 30sec), currently this is accomplished by using a 'representing variable' (facade) that has a value of either 0 or 1 /false or true (the value of the variable is determined by a separate mechanism, that is usually the result of a rule being triggered).

Also it should be noted that due to the near real-time constraints and the amount of data being produced per second, options using a DB with triggers and stored procedures are out of the question.

like image 738
Xander Tulip Avatar asked Apr 16 '12 01:04

Xander Tulip


1 Answers

A couple ideas.

If the conditions for your rules are conjunctions, maintain for each unsatisfied condition an unsatisfied term. Put the rule only on the check lists for that term. If the term becomes satisfied, scan the other terms in order to determine either that the condition is triggered or that there's another unsatisfied term. (I think I learned about this trick in the context of SAT solvers.)

If you have terms like 10 <= x <= 50, use an interval tree instead of a hash quickly to locate the satisfied terms that are about to become unsatisfied by an update to x and the unsatisfied terms that are about to become satisfied. (O(log n) to search at all, plus O(1) per result returned.) There are multidimensional generalizations if considering variables one at a time produces too many spurious hits, but they would be rather more expensive to maintain.

If you don't have terms like that (e.g., a <= b), make derived inputs (b - a) and use your hash strategy to keep them up to date.

like image 188
junction Avatar answered Sep 28 '22 22:09

junction