We have a product using PostgreSQL database server that is deployed at a couple of hundred clients. Some of them have gathered tens of gigabytes of data over the years. So in the next version we will introduce automated cleanup procedures which will gradually archive and DELETE old records during nightly batch jobs.
If I understand correctly, autovacuum will kick in and analyze and reorganize the tuples, so the performance will be like it was when fewer records existed.
The actual disk space will not be released, if I understand correctly, since that only happens with a VACUUM FULL, and that is not triggered by autovacuum.
So I was thinking about an automated process that would do this.
I found the bloat view that is used by nagios check_postgres at http://wiki.postgresql.org/wiki/Show_database_bloat.
Is this view any good? Am I correct in understanding that if the tbloat is > 2, it could use a VACUUM FULL? And if the ibloat is too high, it could use a REINDEX?
Any comments on the following job to run as daily batch job?
vacuumdb -Z mydatabase
#vacuum with analyze-onlyselect tablename from bloatview order by tbloat desc limit 1
vacuumdb -f -t tablename mydatabase
select tablename, iname from bloatview order by ibloat desc limit 1
reindexdb -t tablename -i iname mydatabase
Off course, I still need to wrap it in a nice perl script in crontab (we're using ubuntu 12), or does postgresql have some sort of scheduler I could do this with?
Or is this total overkill and is there a much simpler procedure?
You probably don't need it. It is good to do this once — after first archiving job so you'll get your disk space back but after that your daily archiving job and autovacuum will prevent dead tuples bloat.
Also instead of vacuum full
it is often better to run cluster table_name using index_name; analyze table_name
. This will reorder rows according to an index. This way related table rows can be saved physically close on disk, which can limit disk seeking (important on classic disk drives, largely irrelevant on SSD) and a number of reads for your typical queries.
And remember that both vacuum full
and cluster
will make your tables unusable while they run.
OK, I worked my way through it.
I simplified/reworked the view to split it up in the following two:
CREATE OR REPLACE VIEW
bloat_datawidth AS
SELECT
ns.nspname AS schemaname,
tbl.oid AS relid,
tbl.relname,
CASE
WHEN every(avg_width IS NOT NULL)
THEN SUM((1-null_frac)*avg_width) + MAX(null_frac) * 24
ELSE NULL
END AS datawidth
FROM
pg_attribute att
JOIN
pg_class tbl
ON
att.attrelid = tbl.oid
JOIN
pg_namespace ns
ON
ns.oid = tbl.relnamespace
LEFT JOIN
pg_stats s
ON
s.schemaname=ns.nspname
AND s.tablename = tbl.relname
AND s.inherited=false
AND s.attname=att.attname
WHERE
att.attnum > 0
AND tbl.relkind='r'
GROUP BY
1,2,3;
And
CREATE OR REPLACE VIEW
bloat_tables AS
SELECT
bdw.schemaname,
bdw.relname,
bdw.datawidth,
cc.reltuples::bigint,
cc.relpages::bigint,
ceil(cc.reltuples*bdw.datawidth/current_setting('block_size')::NUMERIC)::bigint AS expectedpages,
100 - (cc.reltuples*100*bdw.datawidth)/(current_setting('block_size')::NUMERIC*cc.relpages) AS bloatpct
FROM
bloat_datawidth bdw
JOIN
pg_class cc
ON
cc.oid = bdw.relid
AND cc.relpages > 1
AND bdw.datawidth IS NOT NULL;
And the cron job:
#!/bin/bash
MIN_BLOAT=65
MIN_WASTED_PAGES=100
LOG_FILE=/var/log/postgresql/bloat.log
DATABASE=unity-stationmaster
SCHEMA=public
if [[ "$(id -un)" != "postgres" ]]
then
echo "You need to be user postgres to run this script."
exit 1
fi
TABLENAME=`psql $DATABASE -t -A -c "select relname from bloat_tables where bloatpct > $MIN_BLOAT and relpages-expectedpages > $MIN_WASTED_PAGES and schemaname ='$SCHEMA' order by wastedpages desc limit 1"`
if [[ -z "$TABLENAME" ]]
then
echo "No bloated tables." >> $LOG_FILE
exit 0
fi
vacuumdb -v -f -t $TABLENAME $DATABASE >> $LOG_FILE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With