Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Why this approach?
Alternatives considered ....
Has anyone done this before? Want to share your thoughts?
If you're looking for widely compatible, low cost, and high-performance data warehouse you should definitely consider PostgreSQL as an option for your data warehouse database. PostgreSQL has many benefits and features useful to manage our data warehouse like partitioning, or stored procedures, and even more.
If you really have two distinct PostgreSQL databases, the common way of transferring data from one to another would be to export your tables (with pg_dump -t ) to a file, and import them into the other database (with psql ).
The Stitch PostgreSQL integration is an ETL that copies data from PostgreSQL databases to other PostgreSQL data warehouses. Stitch can also extract data from many sources, including MySQL and MongoDB, and then load it to PostgreSQL.
Realtime is a server that listens to changes in your PostgreSQL database and broadcasts the changes to clients through a websocket connection.
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key
with output to a file, where last_max_key
is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATE
s, and the sequential key was added by you, then your UPDATE
statements should SET
the key column to NULL
so it gets a new value and gets picked by the next extraction. You would not be able to track DELETE
s without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT
alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT
/UPDATE
/DELETE
performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT
.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With