Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bulk insert of hundreds of millions of records

What is the fastest way to insert 237 million records into a table that has rules (for distributing data across child tables)?

I have tried or considered:

  1. Insert statements.
  2. Transactional inserts (BEGIN and COMMIT).
  3. The COPY FROM command.
  4. http://pgbulkload.projects.postgresql.org/

Inserts are too slow (four days) and COPY FROM ignores rules (and has other issues).

Example data:

station_id,taken,amount,category_id,flag
1,'1984-07-1',0,4,
1,'1984-07-2',0,4,
1,'1984-07-3',0,4,
1,'1984-07-4',0,4,T

Table structure (with one rule included):

CREATE TABLE climate.measurement
(
  id bigserial NOT NULL,
  station_id integer NOT NULL,
  taken date NOT NULL,
  amount numeric(8,2) NOT NULL,
  category_id smallint NOT NULL,
  flag character varying(1) NOT NULL DEFAULT ' '::character varying
)
WITH (
  OIDS=FALSE
);
ALTER TABLE climate.measurement OWNER TO postgres;

CREATE OR REPLACE RULE i_measurement_01_001 AS
    ON INSERT TO climate.measurement
   WHERE date_part('month'::text, new.taken)::integer = 1 AND new.category_id = 1 DO INSTEAD  INSERT INTO climate.measurement_01_001 (id, station_id, taken, amount, category_id, flag)
  VALUES (new.id, new.station_id, new.taken, new.amount, new.category_id, new.flag);

The data was originally in MySQL, but must be switched to PostgreSQL for performance reasons (and to leverage the PL/R extension).

Thank you!

like image 630
Dave Jarvis Avatar asked May 16 '10 02:05

Dave Jarvis


2 Answers

Split your input into separate files outside the database and upload each one using COPY, rather than relying on the rule to distribute them. If the rule you give is any example, that's a trivial text transformation to apply. Also, splitting up front will let you load the split files in parallel if your disk system is up to it.

Seriously, don't rely on the rule to do this distribution for a bulk load. It's practically always the case that bulk load and transactional load need different approaches, unless you're prepared to brute-force one or the other (and, usually, wait).

For instance, your rule uses date_part() to extract the month from the date- so in order to determine the child table, postgres needs to analyse the date string, convert it to a timestamp, and then convert the timestamp back to a calendar just to get the month field back out again. But if you're writing something to do this upfront, you can just do substr($date,5,2) (or equivalent): which do you think will be faster?

It's also an opportunity to clean up the data format so COPY will accept it. Note you can specify the columns with the COPY command: if you weren't doing that with that schema and example file, you'd get errors due to the extra "id" column on the front. ("copy from ... with csv header" may have figured that out, but maybe not... the "header" option may just make it skip the first line).

I've just loaded about 280e6 rows into a postgresql instance myself in a few hours so it's certainly not impossible. For this initial load, I've turned fsync=off; the plan is to load the backlog and then turn it back on again for regular daily loads. I had to set checkpoint_segments=40 to avoid getting checkpoint warnings in the logs. This is just being loaded onto my dev machine- I'm using a dedicated disk for the database, which is different from the disk used for xlogs (i.e. I created a tablespace on the big disk and created the database inside that tablespace). The instance has shared_buffers set to 1Gb, and checkpoint_target set to 0.5. I tried loading some of the partitions in parallel and it didn't provide much improvement, so I suspect the slow disk is being the bottleneck rather than the DB itself.

Just another 1.7e9 rows to go... should be finished tomorrow sometime I hope.

like image 152
araqnid Avatar answered Sep 18 '22 14:09

araqnid


  1. create parent table without any index, only column and there types (create table some_data (c_1 int, c_2 varchar,....))
  2. create sequence for new data tables enumeration
  3. take new id from sequence
  4. create new table for real data with 'like' key word (create table some_data_X like some_data)
  5. insert real data in some_data_X with copy in binary format
  6. create indexes, time constraints etc (empower your cores using multiple connections to postgresql)
  7. inherit parent table
  8. now ready to select! In such way I have achieve 400000-500000 inserts per seconds with index creation on a 10 columns (2 xeon, 24 cores, 24 Gb of memory, SSD).

Bonus: in separete thread remove old data (some_data_X with min X): huge circular buffer with indexing!

like image 22
Oliver Twist Avatar answered Sep 17 '22 14:09

Oliver Twist