I have a table as below (simplified example, we have over 60 fields):
CREATE TABLE "fact_table" (
"pk_a" bigint NOT NULL ENCODE lzo,
"pk_b" bigint NOT NULL ENCODE delta,
"d_1" bigint NOT NULL ENCODE runlength,
"d_2" bigint NOT NULL ENCODE lzo,
"d_3" character varying(255) NOT NULL ENCODE lzo,
"f_1" bigint NOT NULL ENCODE bytedict,
"f_2" bigint NULL ENCODE delta32k
)
DISTSTYLE KEY
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );
The table is distributed by a high-cardinality dimension.
The table is sorted by a pair of fields that increment in time order.
The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".
Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.
Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort
step takes seconds
- The merge
step takes over 6 hours
We can see from SELECT * FROM svv_vacuum_progress;
that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.
Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table
We have tried DELETE and INSERT
rather than UPDATE
and that DML step is now significantly quicker. But the VACUUM
still merges all 2billion rows.
DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds
INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds
VACUUM fact_table;
-- 23645 seconds
In fact, the VACUUM
merges all 2 billion records even if we just trim the last 746 rows off the end of the table.
The Question
Does anyone have any advice on how to avoid this immense VACUUM
overhead, and only MERGE
on the last 0.1% of the table?
How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.
I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.
If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With