Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.

In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.

My question is:

Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?

EDIT Here are some things I've seen before writing the post:

  • Some say "yes go for it" regarding the should I use Redshift for this question.
  • Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
like image 864
alph486 Avatar asked Nov 17 '14 19:11

alph486


People also ask

Can Redshift be used as a database?

Redshift can also be used for traditional data warehousing. But solutions like the S3 data lake would likely be better suited for that. Redshift can be used to perform operations on data in S3, and save the output in S3 or Redshift.

Can I use Redshift as OLTP?

It is common to connect an application framework like Django to Amazon Redshift. This is useful when using Redshift data in your application, i.e. in an OLTP scenario. Since Amazon Redshift is an OLAP database, it may not handle these queries well.

What can Redshift be used for?

What is AWS Redshift Used for? AWS Redshift is a data warehouse product built by Amazon Web Services. It's used for large scale data storage and analysis, and is frequently used to perform large database migrations.

Is Redshift good for real time data?

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse.


2 Answers

Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.

It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.

You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.

If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.

A few other good reasons you probably shouldn't use Redshift directly:

  • OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.

  • Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.

  • Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.

Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.


Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.

like image 128
Craig Ringer Avatar answered Oct 09 '22 22:10

Craig Ringer


In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:

Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.

That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

like image 30
wrschneider Avatar answered Oct 09 '22 22:10

wrschneider