Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern for continuous rule matching

I have a continuous stream of messages which are analyzed. The analysis returns different variables, like author, topic, sentiment, word-count and a set of distinct words. Users in the system are able to define rules, which should trigger an alert, when matched. The rules should be stored in a sql-database. A rule is a conjunction of single criteria from the message analysis, i.e. word-count > 15 && topic = 'StackOverflow' && sentiment > 2.0 && word-set contains 'great'. Each allowed rule-criteria is provided at the end of the message analysis, after which the rule validation will be triggered and which is implemented in Java.

Every message has to be checked for all the rules defined by all users in the system, which takes up a lot of computation power (there are currently 10+ messages/second and there will be 10.000+ rules to check). Is there a common pattern to speed up the matching process, maybe so that the rules can be checked in parallel, except one-by-one? Is it possible to do this in pure SQL, how would a schema for rules of different types look like?

like image 975
Thomas Avatar asked Jan 09 '13 16:01

Thomas


2 Answers

Your considerations are likely to be more than just the throughput of the matching. You need to maintain the rules, for instance.

But, let's assume a static set of rules and messages that contain all the fields needed to satisfy all of the rules. Using SQL, the structure would start with a message table. This table would have an insert trigger. The insert trigger would be responsible for matching to the rules. What is the best way to do this?

With 10+ messages per second, your processing will be inherently parallel, even when each match is single threaded. I'm not sure how much effort you need to parallelize the match. Parallelism in databases generally comes within SQL statements rather than between them.

There are all sorts of solutions. You could encode the rules as code in a giant stored procedure, for instance. This would be a nightmare to maintain, might exceed the length limits of stored procedures, and could be painfully slow.

Another crazy idea. Store the matching messages for a rule in a table, for that rule, and have a constraint only load the ones that match. Your process then looks like a zillion insert statements.

More seriously, you will go further with code such as:

select *
from rules
where . . . 

The result set would have matching rules. The where clause could be something like:

select *
from rules r
where @wordcount > coalesce(r.wordcount, 0) and
      @topic = coalesce(r.topic, @topic) and
      . . .

That is, every possible comparison for all the rules would be in the where clause. And, the rules would be pre-processed to identify which clauses they need.

You can even dispense with the external variables, and access the query directly:

select *
from rules r cross join inserted i
where i.wordcount > coalesce(r.wordcount, 0) and
      i.topic = coalesce(r.topic, @topic) and
      . . .

So, yes, this is feasible in SQL. And, you can do the matching in parallel. You just have to do work to get your rules in a format suitable for database comparisons.

like image 93
Gordon Linoff Avatar answered Oct 09 '22 06:10

Gordon Linoff


I've solved a similar problem in C# though not using SQL.

I stored the rules as serialized XML in the database, for purposes of portability.

At application startup, or when the rule table changed (forcing the rules cache to be flushed) I loaded all the rules from the database and deserialized them into their appropriate classes.

Then as data came in on each app server I executed the rules against the incoming data and for passing rules executed the appropriate action. (At the time I was executing the action in proc on the app server, but now I'd dump it into a queue.)

This has the advantage of spreading out the computation across your app cluster and not keeping it all sucking up cycles on the database machine.

like image 32
cfeduke Avatar answered Oct 09 '22 05:10

cfeduke