Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster comparison for many DateTime's

I have been presented a task that I'm struggling with, in terms of efficiency. I have a database that can have hundreds of thousands of transactions & people. My goal is to find people who commonly have transactions near one another (Person X has had a transaction within 10 minutes of person Y, on 5 separate occasions).

I'm struggling to wrap my head around an efficient way to approach this. The simplest approach is :

foreach(var doc in db.Transactions.OrderBy(d => d.TransactionID))
{
    foreach(var doc2 in db.Transactions.Where(d => d.TransactionID > doc.TransactionID))
    {
        if(doc2.DateCreated.IsBetween(doc.DateCreated,minutes))
        {
           // hit found  
        }
    }
} 

(TransactionID is a bigint identity). Once I have my list of hits, it would be easy to count occurrences. But this is obviously pretty poor. The run time is enter image description here which will be very slow at 1M+ transactions. I've researched some algorithms, but I can't find any that apply to my situation. Can anyone offer guidance on where to begin speeding this up?

like image 935
Jonesopolis Avatar asked Mar 18 '14 12:03

Jonesopolis


2 Answers

Few tips:

  1. Do it on the database side (e.g. stored proc) - loading and processing 1M+ records will create overhead even if the algorithm is improved.
  2. Separate all your data into buckets of size 10 minutes (assuming 10 minutes is your detection threshold). Then for each bucket you'd only need to check the adjacent ones, which should reduce the volume of comparison operations.
  3. Make sure you operate on e.g. epoch time to avoid complex datetime operations.
like image 54
decPL Avatar answered Oct 11 '22 00:10

decPL


In addition to decPL's tips, you may want to set up a Data Warehouse with transaction data, that can then be analyzed, for instance, at night. This would mean you store data about your data in a separate database, and that database is then scanned for patterns using known algorithms. This is the way services such as Amazon come up with those "People who bought this, also bought..." suggestions.

The data in the warehouse can be optimized for fast processing, so it doesn't need to follow the same format as your actual ("source") database. The output of the analytical process (Reporting) can also be in a format that is easy to process afterward, possibly using LINQ like you do in your question.

For more info see http://en.wikipedia.org/wiki/Data_warehouse and http://www.1keydata.com/datawarehousing/datawarehouse.html.

like image 3
Roy Dictus Avatar answered Oct 10 '22 23:10

Roy Dictus