Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

database for analytics

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:

  1. Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
  2. Various statistics will be generated from the current day's data.
  3. Reports from these statistics will be generated and stored.
  4. Current data set will get copied into a partitioned history table.
  5. Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
  6. Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
  7. Before the next day's upload, the current data table is truncated.

This will essentially be version 2 of our existing system.

Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.

We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.

I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?

Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.

NOTE on MyISAM vs Innodb for us: We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.

We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.

In regards to 5.1: I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.

In regards to PostgreSQL gotchas: COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue. COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that. My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.

like image 839
rfusca Avatar asked Feb 28 '23 21:02

rfusca


2 Answers

My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.

Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.

If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.

Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.

Lastly, use the table space feature to relocate the data onto a fast drive array.

like image 58
Avery Payne Avatar answered Mar 03 '23 10:03

Avery Payne


In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.

Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.

Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.

In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.

Ah, yes, did I say that you should go for postgresql?

(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)

like image 30
Mauli Avatar answered Mar 03 '23 11:03

Mauli