Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the minimum hardware requirement & setup for a PostgreSQL cluster with Rails frontend with a lot of small writes and long reads?

Background:

Trying to figure out my minimum requirements for an app I'm building. I'm fairly fluent in MySQL and PostgreSQL for a developer, but I'm not a DBA, hence my question. I'm building a mobile app, that will talk to a remote API, and need to figure out requirements for this API. At this point I'm doing this as hobby project and mobile App is going be free, so I don't have a big budget - and need to figure out my requirements as close as possible.

Application requirements:

Remote API is done in Rails providing web & json interfaces, that stores data in a postgresql cluster. Mobile app will send a lot of short writes ~ about 1 every minute * 20,000 app installations. Most of the reads are going to be - report style, so longer reads, that don't happen that often - Maybe once or twice a day per user. So DB mostly needs to be optimized for writes. Read actions can always be redirected to a replica cluster / server, as at this point they don't need to be real time. One day delay is fine.

More details per question in comments:

1) Writes are very small: I'll be sending back some kine of auth token ( like an api key) and just very little data - we are talking less thank 1kb of data: timestamp, and gps coordinates, maybe something else eventually, but I doubt it. I don't envision any large data like pictures or anything like that. It's going to be similar to running / jogging / biking tracking app.

2) Scaling up? Hmm. 200,000 - 400,000 apps max if it really takes off within a first 2 years.

3) Data is fairly critical. The whole point is to be able to run accurate reports once data is collected. There are two options however to mitigate the issue:

  • I can estimate it based on Google maps data and last known points ( right before data is lost, and right after connection is reestablished.)
  • Data is first saved on the phone in a sqlite storage and once a day ( or at app start up) it's synced with the server / verified. Once verification / synchronization is successful data on the phone can be rotated ( like anything older than 1 month can be wiped of the phone)

Actual Question Details

So my question is to someone who dealt with apps on this scale, what was your initial PostgreSQL setup like - both cluster configuration and hardware(cloud)-wise, and how easy / difficult it was to scale?


TO prevent irrelevant suggestions and answers:

NoSQL alternatives

I considered NoSQL alternatives like CouchDB, MongoDB, etc . Riak actually came out a winner, considering that's it's easy to manage for a one-man team and I only need 3 db servers to have a working replicating cluster. But after mapping out my app, I figured that NoSQL is not a good fit for this app, it belongs in the realm of RDBMS's .

NoSQL alternatives & SQL options

Considering my non-existent budget I didn't even consider SQL server and Oracle and such. MySQL is the only other real alternative, but I need hstore, and replication is easier to implement right now in PostgreSQL IMHO.

like image 959
konung Avatar asked Aug 15 '13 17:08

konung


1 Answers

This is good news:

Data is first saved on the phone in a sqlite storage...

So - we're not having to cope with bursts of small writes we can batch updates together. What is more, we can reject them occasionally and the app can try again later. Also good, we can probably rent monthly rather than hourly (cheaper!).

That means our limit is purely down to maximum sustainable disk I/O. Now, you mention "the cloud" which complicates things. Cheap disk I/O is typically poor for (any type of) database and the good stuff is expensive.

Some back-of-envelope calculations...

20,000 apps @ 1kB / min ~ 20 MB/min ~ 333 kB/sec 200,000 apps @ 1kB / min ~ 200 MB/min ~ 3.3 MB/sec

Now you'll be writing to WAL (transaction log) first and then the tables, and you need to allow for your reporting but that's not much at all. If your disk requirements increase much then you might well be better off with a couple of managed, real machines with their own disks.

So - script yourself a PostgreSQL server setup. I find ansible is fairly easy to get going with. Add some test scripts to simulate different numbers of requests and batch sizes. You should be able to spin up a VM, and run a batch of tests and get some real figures in a couple of hours per provider.

like image 120
Richard Huxton Avatar answered Nov 10 '22 12:11

Richard Huxton