Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I create on-demand reports once they become too slow for our DB?

Our App / Data

We have a Python app with Users in Transactions that have Commissions, Fees, etc., with Contacts that receive EmailMessages, and which Activitys take place on (Documents uploaded, Status changes, etc.).

Our Reports

We generate spreadsheet reports for our customers that detail such info as the number of documents uploaded to transactions, sums of various types of commissions earned, fees charged, activities, etc. These reports, in some cases, provide statistics for a customer's account, for each month in a given year (each month in its own row in the spreadsheet).

Our Problem

We've reached a point with our web app where certain spreadsheet reports that we generate are taking minutes to produce (all waiting on Postgres), despite efforts to optimize queries, add indexes, and despite that we use only SSDs and have enough RAM to fit the database into memory. Essentially, we've reached a scale where some basic reports are becoming too expensive to run as simple aggregation queries against our production database.

Solutions I Am Considering

  1. Denormalize statistics into existing tables in Postgres
  2. Cache statistics in Memcached
  3. Reduce / simplify queries by moving some of the crunching into Python
  4. Run expensive reports in a queue and notify admins when they're ready
  5. Store statistics in separate reporting tables (star schema, etc.)
  6. Sharding

I already use options 1-4 above to an extent, but I would like to explore more options. Also, I'd like to stop using option 4 entirely, if possible, and I'm not too keen on implementing option 5 (vs simply using something like Redshift). Option 6 is a great option in some cases, but it's not something we're prepared to take on at the moment.

Where Should I Look?

I started actually looking into Redshift, but something that threw me for a loop this morning was reading (here) that "It is not a real-time analysis engine." Does this also mean "it is not useful for generating reports within a single web request", or is it more likely that this blog is stating that it's not useful for real-time apps (online games, etc.)?

I've also looked at Quicksight, but it appears to be more for building business dashboards for ourselves, rather than generating reports for our users.

How would you approach this problem, given the above information? Is Redshift the obvious answer and my above concerns about not good for real-time are moot? Is there some other service or tool or methodology that would make more sense for you in a situation like this?

like image 633
orokusaki Avatar asked Dec 26 '17 20:12

orokusaki


1 Answers

It definitely means Redshift is not good for real-time loading and reporting. Redshift is a columnar based database, so writes to it are (relatively) expensive whereas reads are lightning fast compared to a row based database (like MySQL).

What this means is that Redshift is ideal for queries that need to read a large amount of data, but you should load to Redshift in batches.

I have used Redshift many times for something like your use case. The production data is cloned into Redshift multiple times per day (say every 30 minutes, incrementally. There are many vendors that could do this for you). Whenever a report is needed, the query hits Redshift instead of a production database. Not only will the query run faster, it won't lock up your production db.

Additionally, if the query return times are still not fast enough for your liking. You can set up data pipelines to create summary tables. Instead of querying raw transaction data for every report, you could hit these summary tables

eg

SELECT date(transaction_date) as day, count(1) as transactions
FROM transactions
GROUP BY day 
ORDER BY day

could just become

SELECT day, transactions
FROM transactions_summary_by_day

The tradeoff is latency. Since you are not writing to Redshift constantly, any report pulled from Redshift will only have data as the most recent write batch. Maybe that's 30 minutes, maybe that's 1 day, it depends what you set up. Data pipelines will increase this latency, as reports built off of them use only data since they last ran, which relies on Redshift data that was loaded at that time.

If your users need true "real-time" reporting this might be a dealbreaker. But if they work in terms of days or weeks, having an hour or so of latency is worth it to get quick-loading reports.

like image 82
ScottieB Avatar answered Nov 11 '22 00:11

ScottieB