Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Report Generation Design Patterns in Rails?

I am building several reports in an application and have come across a few ways of building the reports and wanted to get your take on the best/common ways to build reports that are both scalable and as real-time as possible.

First, some conditions/limits/goals:

  1. The report should be able to handle being real time (using node.js or ajax polling)
  2. The report should update in an optimized way
    • If the report is about page views, and you're getting thousands a second, it might not be best to update the report every page view, but maybe every 10 or 100.
    • But it should still be close to real-time (so daily/hourly cron is not an acceptable alternative).
  3. The report shouldn't be recalculating things that it's already calculated.
    • If it's has counts, it increments a counter.
    • If it has averages, maybe it can somehow update the average without grabbing all records it's averaging every second and recalculating (not sure how to do this yet).
    • If it has counts/averages for a date range (today, last_week, last_month, etc.), and it's real-time, it shouldn't have to recalculate those averages every second/request, somehow only do the most minimal operation.
  4. If the report is about a record and the record's "lifecycle" is complete (say a Project, and the project lasted 6 months, had a bunch of activity, but now it's over), the report should be permanently saved so subsequent retrievals just pull a pre-computed document.

The reports don't need to be searchable, so once the data is in a document, we're just displaying the document. The client gets basically a JSON tree representing all the stats, charts, etc. so it can be rendered however in Javascript.

My question arises because I am trying to figure out a way to do real-time reporting on huge datasets.

Say I am reporting about overall user signup and activity on a site. The site has 1 million users, and there are on average 1000 page views per second. There is a User model and a PageView model let's say, where User has_many :page_views. Say I have these stats:

report = {
  :users => {
    :counts => {
      :all        => user_count,
      :active     => active_user_count,
      :inactive   => inactive_user_count
    },
    :averages => {
      :daily      => average_user_registrations_per_day,
      :weekly     => average_user_registrations_per_week,
      :monthly    => average_user_registrations_per_month,
    }
  },
  :page_views => {
    :counts => {
      :all        => user_page_view_count,
      :active     => active_user_page_view_count,
      :inactive   => inactive_user_page_view_count
    },
    :averages => {
      :daily      => average_user_page_view_registrations_per_day,
      :weekly     => average_user_page_view_registrations_per_week,
      :monthly    => average_user_page_view_registrations_per_month,
    }
  },
}

Things I have tried:

1. Where User and PageView are both ActiveRecord objects, so everything is via SQL.

I grab all of the users in chunks something like this:

class User < ActiveRecord::Base
  class << self
    def report
      result = {}
      User.find_in_batches(:include => :page_views) do |users|
        # some calculations
        # result[:users]...
        users.each do |user|
          # result[:users][:counts][:active]...
          # some more calculations
        end
      end
      result
    end
  end
end

2. Both records are MongoMapper::Document objects

Map-reduce is really slow to calculate on the spot, and I haven't yet spent the time to figure out how to make this work real-time-esque (checking out hummingbird). Basically I do the same thing: chunk the records, add the result to a hash, and that's it.

3. Each calculation is it's own SQL/NoSQL query

This is kind of the approach the Rails statistics gem takes. The only thing I don't like about this is the amount of queries this could possibly make (haven't benchmarked whether making 30 queries per-request-per-report is better than chunking all the objects into memory and sorting in straight ruby)

Question

The question I guess is, what's the best way, from your experience, to do real-time reporting on large datasets? With chunking/sorting the records in-memory every request (what I'm doing now, which I can somewhat optimize using hourly-cron, but it's not real-time), the reports take about a second to generate (complex date formulas and such), sometimes longer.

Besides traditional optimizations (better date implementation, sql/nosql best practices), where I can I find some practical and tried-and-true articles on building reports? I can build reports no problem, the issue is, how do you make it fast, real-time, optimized, and right? Haven't found anything really.

like image 353
Lance Avatar asked Jan 18 '11 19:01

Lance


1 Answers

The easiest way to build near real-time reports for your use case is to use caching.

So in report method, you need to use rails cache

class User < ActiveRecord::Base
  class << self
    def report
      Rails.cache.fetch('users_report', expires_in: 10.seconds) do
        result = {}
        User.find_in_batches(:include => :page_views) do |users|
          # some calculations
          # result[:users]...
          users.each do |user|
            # result[:users][:counts][:active]...
            # some more calculations
          end
        end
        result
      end
    end
  end
end

And on client-side you just request this report with ajax pooling. That way generating this reports won't be a bottleneck as generating them takes ~1second, and many clients can easily get the latest result.

For better user experience you can store delta between two reports and increment your report on client side using this delta prediction, like this:

let nextPredictedReport = null;
let currentReport = null;

const startDrawingPredicted = () => {
  const step = 500;
  const timePassed = 0;
  setInterval(() => {
    timePassed += step;
    const predictedReport = calcDeletaReport(currentReport, nextPredictedReport, timePassed);
    drawReport(predictedReport);
  }, step);
};

setInterval(() => {
  doReportAjaxRequest().then((response) => {
    drawReport(response.report);
    currentReport = response.report;
    nextPredictedReport = response.next_report;
    startDrawingPredicted();
  });
}, 10000);

that's just an example of the approach, calcDeletaReport and drawReport should be implemented on your own + this solution might have issues, as it's just an idea :)

like image 59
mpospelov Avatar answered Nov 09 '22 11:11

mpospelov