I am building several reports in an application and have come across a few ways of building the reports and wanted to get your take on the best/common ways to build reports that are both scalable and as real-time as possible.
First, some conditions/limits/goals:
Project
, and the project lasted 6 months, had a bunch of activity, but now it's over), the report should be permanently saved so subsequent retrievals just pull a pre-computed document.The reports don't need to be searchable, so once the data is in a document, we're just displaying the document. The client gets basically a JSON tree representing all the stats, charts, etc. so it can be rendered however in Javascript.
My question arises because I am trying to figure out a way to do real-time reporting on huge datasets.
Say I am reporting about overall user signup and activity on a site. The site has 1 million users, and there are on average 1000 page views per second. There is a User
model and a PageView
model let's say, where User has_many :page_views
. Say I have these stats:
report = {
:users => {
:counts => {
:all => user_count,
:active => active_user_count,
:inactive => inactive_user_count
},
:averages => {
:daily => average_user_registrations_per_day,
:weekly => average_user_registrations_per_week,
:monthly => average_user_registrations_per_month,
}
},
:page_views => {
:counts => {
:all => user_page_view_count,
:active => active_user_page_view_count,
:inactive => inactive_user_page_view_count
},
:averages => {
:daily => average_user_page_view_registrations_per_day,
:weekly => average_user_page_view_registrations_per_week,
:monthly => average_user_page_view_registrations_per_month,
}
},
}
Things I have tried:
User
and PageView
are both ActiveRecord objects, so everything is via SQL.I grab all of the users in chunks something like this:
class User < ActiveRecord::Base
class << self
def report
result = {}
User.find_in_batches(:include => :page_views) do |users|
# some calculations
# result[:users]...
users.each do |user|
# result[:users][:counts][:active]...
# some more calculations
end
end
result
end
end
end
MongoMapper::Document
objectsMap-reduce is really slow to calculate on the spot, and I haven't yet spent the time to figure out how to make this work real-time-esque (checking out hummingbird). Basically I do the same thing: chunk the records, add the result to a hash, and that's it.
This is kind of the approach the Rails statistics gem takes. The only thing I don't like about this is the amount of queries this could possibly make (haven't benchmarked whether making 30 queries per-request-per-report is better than chunking all the objects into memory and sorting in straight ruby)
The question I guess is, what's the best way, from your experience, to do real-time reporting on large datasets? With chunking/sorting the records in-memory every request (what I'm doing now, which I can somewhat optimize using hourly-cron, but it's not real-time), the reports take about a second to generate (complex date formulas and such), sometimes longer.
Besides traditional optimizations (better date implementation, sql/nosql best practices), where I can I find some practical and tried-and-true articles on building reports? I can build reports no problem, the issue is, how do you make it fast, real-time, optimized, and right? Haven't found anything really.
The easiest way to build near real-time reports for your use case is to use caching.
So in report method, you need to use rails cache
class User < ActiveRecord::Base
class << self
def report
Rails.cache.fetch('users_report', expires_in: 10.seconds) do
result = {}
User.find_in_batches(:include => :page_views) do |users|
# some calculations
# result[:users]...
users.each do |user|
# result[:users][:counts][:active]...
# some more calculations
end
end
result
end
end
end
end
And on client-side you just request this report with ajax pooling. That way generating this reports won't be a bottleneck as generating them takes ~1second, and many clients can easily get the latest result.
For better user experience you can store delta between two reports and increment your report on client side using this delta prediction, like this:
let nextPredictedReport = null;
let currentReport = null;
const startDrawingPredicted = () => {
const step = 500;
const timePassed = 0;
setInterval(() => {
timePassed += step;
const predictedReport = calcDeletaReport(currentReport, nextPredictedReport, timePassed);
drawReport(predictedReport);
}, step);
};
setInterval(() => {
doReportAjaxRequest().then((response) => {
drawReport(response.report);
currentReport = response.report;
nextPredictedReport = response.next_report;
startDrawingPredicted();
});
}, 10000);
that's just an example of the approach, calcDeletaReport
and drawReport
should be implemented on your own + this solution might have issues, as it's just an idea :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With