Normally you write a query and get all the records (entities) that match it. I need to do the reverse.
Let's say I have 1M customers with a couple dozen denormalized properties:
public class Customer {
public string Name {get;set;}
public string Email {get;set;}
public string Phone {get;set;}
public DateTime Birthday {get;set;}
public DateTime LastEmailed {get;set;}
public DateTime LastCalled {get;set;}
public int AgeInYears {get { return DateTime.UtcNow.Year - birthdate.Year;}}
public int SalesTerritoryId {get;set;}
// etc.
}
And I have 10k users that want to setup custom filters and be notified when any new customer matches the rules they defined.
Some of these rules are evaluated when the customer is created/updated (e.g.)
Other rules will run periodically (e.g.)
On a daily basis there will be millions of saves to customers and 5-10k custom filters to be checked against each new/updated customer.
I realize I could use Expression Trees for the user's filters, but then end up doing something like this:
public class CustomerRule : IRule {
public bool IsMatch() {
// Expression Tree Stuff
}
public bool DoAction() {
// Notification Stuff
}
}
public class CustomerService {
public void SaveOrUpdate {
IList<IRule> rules = GetRules();
// this isn't going to handle 1M save/updates * 10k rules very well
foreach (var rule in rules){
if(rule.IsMatch()) {
rule.DoAction();
}
}
}
}
I know others have solved this problem but I'm having a hard time figuring out what exactly to look for. General guidance is appreciated, specific patterns, code, tools, etc. even better. We primarily use C# but can go outside the .NET world if need be.
I'd mention different point than other answers. You claim in your code that
// this isn't going to handle 1M save/updates * 10k rules very well
But did you really verified this? Consider this code:
public class Program {
static List<Func<Customer, bool>> _rules = new List<Func<Customer, bool>>();
static void Main(string[] args) {
foreach (var i in Enumerable.Range(0, 10000)) {
// generate simple expression, but joined with OR conditions because
// in this case (on random data) it will have to check them all
// c => c.Name == ".." || c.Email == Y || c.LastEmailed > Z || territories.Contains(c.TerritoryID)
var customer = Expression.Parameter(typeof(Customer), "c");
var name = Expression.Constant(RandomString(10));
var email = Expression.Constant(RandomString(12));
var lastEmailed = Expression.Constant(DateTime.Now.AddYears(-20));
var salesTerritories = Expression.Constant(Enumerable.Range(0, 5).Select(c => random.Next()).ToArray());
var exp = Expression.OrElse(Expression.OrElse(Expression.OrElse(
Expression.Equal(Expression.PropertyOrField(customer, "Name"), name),
Expression.Equal(Expression.PropertyOrField(customer, "Email"), email)),
Expression.GreaterThan(Expression.PropertyOrField(customer, "LastEmailed"), lastEmailed)),
Expression.Call(typeof(Enumerable), "Contains", new Type[] {typeof(int)}, salesTerritories, Expression.PropertyOrField(customer, "SalesTerritoryId")));
// compile
var l = Expression.Lambda<Func<Customer, bool>>(exp, customer).Compile();
_rules.Add(l);
}
var customers = new List<Customer>();
// generate 1M customers
foreach (var i in Enumerable.Range(0, 1_000_000)) {
var cust = new Customer();
cust.Name = RandomString(10);
cust.Email = RandomString(10);
cust.Phone = RandomString(10);
cust.Birthday = DateTime.Now.AddYears(random.Next(-70, -10));
cust.LastEmailed = DateTime.Now.AddDays(random.Next(-70, -10));
cust.LastCalled = DateTime.Now.AddYears(random.Next(-70, -10));
cust.SalesTerritoryId = random.Next();
customers.Add(cust);
}
Console.WriteLine($"Started. Customers {customers.Count}, rules: {_rules.Count}");
int matches = 0;
var w = Stopwatch.StartNew();
// just loop
Parallel.ForEach(customers, c => {
foreach (var rule in _rules) {
if (rule(c))
Interlocked.Increment(ref matches);
}
});
w.Stop();
Console.WriteLine($"matches {matches}, elapsed {w.ElapsedMilliseconds}ms");
Console.ReadKey();
}
private static readonly Random random = new Random();
public static string RandomString(int length)
{
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}
}
public class Customer {
public string Name { get; set; }
public string Email { get; set; }
public string Phone { get; set; }
public DateTime Birthday { get; set; }
public DateTime LastEmailed { get; set; }
public DateTime LastCalled { get; set; }
public int AgeInYears
{
get { return DateTime.UtcNow.Year - Birthday.Year; }
}
public int SalesTerritoryId { get; set; }
}
Here I generate 10K rules in form of expressions. They are simple, but not trivial - 4 conditions joined with OR, with strings, dates, Contains. Then I generate 1M customer updates (number of customers in your database is irrelevant - we only work with updates) and just run a loop. Guess how long does it take on my regular (non-server) PC? 4 minutes.
So all your rules for all customer updates for the whole day can be checked in just 4 minutes (at proper server it should be at least x2 faster than that, probably more). Checking single update against 10K rules takes few milliseconds. Given that - you will most likely have bottlenecks in any other place, not here. You can apply a couple of trivial optimizations on top of that, if you'd like:
Collapse identical rules. No need to check "is birthday today" rule for every user.
Store properties which are used in a rule and also note which columns were updated in Customer. Don't run rules which do not use columns updated in Customer.
But actually that might even slow you down, not speed up, so everything should be measuree.
Don't send notifications from the same code which does rule check. Put them into queue and let other process\threads handle them. Checking rules is strictly CPU bound work, and sending notifications (I assume, in your case) is IO bound, so you might actually be able to do that on one machine, in one process. You also don't want to spam given user with notifications at this rate, you will most likely send them in batches, at most one batch per minute I think, so this won't be too costly.
As for customer updates themselves - you might store them in some queue (like rabbitMQ), use database notifications (like postgresql pg_notify
) or just poll database every minute to get all updates for that period. Again, perfomance of different approaches should be measured.
In addition to that, this kind of task is easily parallelizable on multiple machines, so if you will ever hit 100M customers - no problem, you can just add one more server (or maybe one will still be fine).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With