have an application in which many companies post information. the data from each company is self contained - there is no data overlap.
performance-wise, is it better to:
web-based application with persistent connections.
my thoughts:
I'd recommend searching for info on the PostgreSQL mailing lists about multi-tenanted design. There's been lots of discussion there, and the answer boils down to "it depends". There are trade-offs every way between guaranteed isolation, performance, and maintainability.
A common approach is to use a single database, but one schema (namespace) per customer with the same table structure in each schema, plus a shared or common schema for data that's the same across all of them. A PostgreSQL schema is like a MySQL "database" in that you can query across different schema but they're isolated by default. With customer data in separate schema you can use the search_path
setting, usually via ALTER USER
customername SET search_path = 'customerschema, sharedschema'
to ensure each customer sees their data and only their data.
For additional protection, you should REVOKE
ALL FROM SCHEMA customerschema FROM public
then GRANT
ALL ON SCHEMA customerschema TO thecustomer
so they're the only one with any access to it, doing the same to each of their tables. Your connection pool then can log in with a fixed user account that has no GRANT
ed access to any customer schema but has the right to SET ROLE
to become any customer. (Do that by giving them membership of each customer role with NOINHERIT set so rights have to be explicitly claimed via SET ROLE
). The connection should immediately SET ROLE
to the customer it's currently operating as. That'll allow you to avoid the overhead of making new connections for each customer while maintaining strong protection against programmer error leading to access to the wrong customer's data. So long as the pool does a DISCARD ALL
and/or a RESET ROLE
before handing connections out to the next client, that's going to give you very strong isolation without the frustration of individual connections per-user.
If your web app environment doesn't have a decent connection pool built-in (say, you're using PHP with persistent connections) then you really need to put a good connection pool in place between Pg and the web server anyway, because too many connections to the backend will hurt your performance. PgBouncer and PgPool-II are the best options, and handily can take care of doing the DISCARD ALL
and RESET ROLE
for you during connection hand-off.
The main downside of this approach is the overhead with maintaining that many tables, since your base set of non-shared tables is cloned for each customer. It'll add up as customer numbers grow, to the point where the sheer number of tables to examine during autovacuum runs starts to get expensive and where any operation that scales based on the total number of tables in the DB slows down. This is more of an issue if you're thinking of having many thousands or tens of thousands of customers in the same DB, but I strongly recommend you do some scaling tests with this design using dummy data before committing to it.
The ideal approach is likely to be single tables with automatic row-level security controlling tuple visibility, but unfortunately that's something PostgreSQL doesn't have yet. It looks like it's on the way thanks to the SEPostgreSQL work adding suitable infrastructure and APIs, but it's not in 9.1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With