Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

VACUUM on Redshift (AWS) after DELETE and INSERT

I have a table as below (simplified example, we have over 60 fields):

CREATE TABLE "fact_table" (
  "pk_a" bigint                 NOT NULL ENCODE lzo,
  "pk_b" bigint                 NOT NULL ENCODE delta,
  "d_1"  bigint                 NOT NULL ENCODE runlength,
  "d_2"  bigint                 NOT NULL ENCODE lzo,
  "d_3"  character varying(255) NOT NULL ENCODE lzo,
  "f_1"  bigint                 NOT NULL ENCODE bytedict,
  "f_2"  bigint                     NULL ENCODE delta32k
)
DISTSTYLE KEY
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );

The table is distributed by a high-cardinality dimension.

The table is sorted by a pair of fields that increment in time order.

The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".


Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.

Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes over 6 hours

We can see from SELECT * FROM svv_vacuum_progress; that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.


Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table


We have tried DELETE and INSERT rather than UPDATE and that DML step is now significantly quicker. But the VACUUM still merges all 2billion rows.

DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds

INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds

VACUUM fact_table;
-- 23645 seconds

In fact, the VACUUM merges all 2 billion records even if we just trim the last 746 rows off the end of the table.


The Question

Does anyone have any advice on how to avoid this immense VACUUM overhead, and only MERGE on the last 0.1% of the table?

like image 635
MatBailie Avatar asked May 02 '14 11:05

MatBailie


1 Answers

How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.

I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.

If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.

like image 197
Joe Harris Avatar answered Nov 15 '22 03:11

Joe Harris