I have the following tables:
CREATE TABLE views (
view_id bigint NOT NULL,
usr_id bigint,
ip inet,
referer_id bigint,
country_id integer,
validated smallint,
completed smallint,
value numeric
);
ALTER TABLE ONLY views
ADD CONSTRAINT "Views_pkey" PRIMARY KEY (view_id);
CREATE TABLE country (
country_id integer NOT NULL,
country character varying(2)
);
ALTER TABLE ONLY country
ADD CONSTRAINT country_pkey PRIMARY KEY (country_id);
CREATE TABLE file_id_view_id (
file_id bigint,
view_id bigint,
created_ts timestamp without time zone
);
CREATE TABLE file_owner (
file_id bigint NOT NULL,
owner_id bigint
);
ALTER TABLE ONLY file_owner
ADD CONSTRAINT owner_table_pkey PRIMARY KEY (file_id);
CREATE TABLE referer (
referer_id bigint NOT NULL,
referer character varying(255)
);
ALTER TABLE ONLY referer
ADD CONSTRAINT referer_pkey PRIMARY KEY (referer_id);
The views
and file_id_view_id
table have approximately 340M rows each. Each hour they will both increase by 600K rows.
The file_owner
table has 75K rows and will increase hourly by 100 rows.
The country
table has 233 rows and rarely changes.
The referer
table has 6494 rows and rarely changes.
My goal is to be able to perform a query such as:
SELECT Count(ft.*) AS total_views,
( Count(ft.*) - SUM(ft.valid) ) AS invalid_views,
SUM(ft.valid) AS valid_views,
SUM(ft.values) AS VALUES,
ft.day AS day,
( CASE
WHEN r.referer IS NULL THEN 'Unknown'
ELSE r.referer
END ) AS referer,
( CASE
WHEN c.country IS NULL THEN 'Unknown'
ELSE c.country
END ) AS country
FROM country c
right join (referer r
right join (SELECT v.validated AS valid,
v.value AS VALUES,
vf.day AS day,
vf.view_id AS view_id,
v.referer_id AS referer_id,
v.country_id AS country_id
FROM VIEWS v,
(SELECT view_id,
fivi.created_ts :: timestamp :: DATE AS
day
FROM file_id_view_id fivi
join (SELECT file_id
FROM file_owner
WHERE owner_id = 75
GROUP BY file_id) fo
ON ( fo.file_id = fivi.file_id )
WHERE ( fivi.created_ts BETWEEN
'2015-11-01' AND '2015-12-01' )
GROUP BY view_id,
day) vf
WHERE v.view_id = vf.view_id) ft
ON ( ft.referer_id = r.referer_id ))
ON ( ft.country_id = c.country_id )
GROUP BY day,
referer,
country;
To produce:
total_views | invalid_views | valid_views | values | day | referer | country
------------+---------------+-------------+--------+------------+-----------------+---------
When running such query with EXPLAIN ANALYZE
the following is produced:
GroupAggregate (cost=38893491.99..40443007.61 rows=182295955 width=52) (actual time=183725.696..205882.889 rows=172 loops=1)
Group Key: ((fivi.created_ts)::date), r.referer, c.country
-> Sort (cost=38893491.99..38984639.97 rows=182295955 width=52) (actual time=183725.655..200899.098 rows=8390217 loops=1)
Sort Key: ((fivi.created_ts)::date), r.referer, c.country
Sort Method: external merge Disk: 420192kB
-> Hash Left Join (cost=16340128.88..24989809.75 rows=182295955 width=52) (actual time=23399.900..104337.332 rows=8390217 loops=1)
Hash Cond: (v.country_id = c.country_id)
-> Hash Left Join (cost=16340125.36..24800637.72 rows=182295955 width=49) (actual time=23399.782..102534.655 rows=8390217 loops=1)
Hash Cond: (v.referer_id = r.referer_id)
-> Merge Join (cost=16340033.52..24051874.62 rows=182295955 width=29) (actual time=23397.410..99955.000 rows=8390217 loops=1)
Merge Cond: (fivi.view_id = v.view_id)
-> Group (cost=16340033.41..16716038.36 rows=182295955 width=16) (actual time=23397.298..30454.444 rows=8390217 loops=1)
Group Key: fivi.view_id, ((fivi.created_ts)::date)
-> Sort (cost=16340033.41..16434985.73 rows=189904653 width=16) (actual time=23397.294..28165.729 rows=8390217 loops=1)
Sort Key: fivi.view_id, ((fivi.created_ts)::date)
Sort Method: external merge Disk: 180392kB
-> Nested Loop (cost=6530.43..8799350.01 rows=189904653 width=16) (actual time=63.123..15131.956 rows=8390217 loops=1)
-> HashAggregate (cost=6530.31..6659.62 rows=43104 width=8) (actual time=62.983..90.331 rows=43887 loops=1)
Group Key: file_owner.file_id
-> Bitmap Heap Scan on file_owner (cost=342.90..6508.76 rows=43104 width=8) (actual time=5.407..50.779 rows=43887 loops=1)
Recheck Cond: (owner_id = 75)
Heap Blocks: exact=5904
-> Bitmap Index Scan on owner_id_index (cost=0.00..340.74 rows=43104 width=0) (actual time=4.327..4.327 rows=45576 loops=1)
Index Cond: (owner_id = 75)
-> Index Scan using file_id_view_id_indexing on file_id_view_id fivi (cost=0.11..188.56 rows=4406 width=24) (actual time=0.122..0.306 rows=191 loops=43887)
Index Cond: (file_id = file_owner.file_id)
Filter: ((created_ts >= '2015-11-01 00:00:00'::timestamp without time zone) AND (created_ts <= '2015-12-01 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 184
-> Index Scan using "Views_pkey" on views v (cost=0.11..5981433.17 rows=338958763 width=25) (actual time=0.088..46804.757 rows=213018702 loops=1)
-> Hash (cost=68.77..68.77 rows=6591 width=28) (actual time=2.344..2.344 rows=6495 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 410kB
-> Seq Scan on referer r (cost=0.00..68.77 rows=6591 width=28) (actual time=0.006..1.156 rows=6495 loops=1)
-> Hash (cost=2.70..2.70 rows=233 width=7) (actual time=0.078..0.078 rows=233 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> Seq Scan on country c (cost=0.00..2.70 rows=233 width=7) (actual time=0.005..0.042 rows=233 loops=1)
Planning time: 1.015 ms
Execution time: 206034.660 ms
(37 rows)
Plan on explain.depesz.com: http://explain.depesz.com/s/OiN
206s run time.
Some things to note,
Postgresql Version 9.4
I have adjusted the config as follows:
The following indexes currently exist:
The previous query was using an owner id which was picked conservatively, some queries may result in 1/3 of the file_id_view_id table being joined with views.
Changing the data structure is a last resort. At this stage such a change must be due to serious concerns.
The db can be considered as read only if need be, the data being written is done on an hourly basis and plenty of breathing room is given to Postgres after each write. At the current moment during a 600K hourly write the db is returning in 1100s (this is due to other reasons alongside the insert cost). There is plenty of room to add additional indexes if it would increase the read speeds, the read speed is the priority.
The hardware specifications are:
CPU: http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz
RAM: 128GB
STORAGE: 1.5TB PCIE SSD
How can I optimize either my database or query so that I can retrieve information I need out of the db in a reasonable time frame?
What can I do to optimize my current design?
I believe Postgres and the hardware it is running on has the capability to perform much better than it is currently.
UPDATE
I have tried:
Does anyone have any experience restructuring tables this big? Is it feasible? Will it take days, hours (estimating of course)?
I am considering de-normalizing the database, since it really will only be referenced in this method. My only concern with that is - if 100M rows were to be called from a table with an indexed owner_id would it be fast enough or would I still be facing the same performance issues? Would hate to go one way then have to backtrack.
The other solution I am looking into is @ivan.panasuik suggestion, group all the day data into another table since once the day has passed that information is constant and does not need to be changed or updated. However I am unsure how to implement this smoothly - should I have queries run through the data while the inserts are on hold and catch the days up as fast as possible? From then on have a trigger set?
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.
Just like any advanced relational database, PostgreSQL uses a cost-based query optimizer that tries to turn your SQL queries into something efficient that executes in as little time as possible.
Description. pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second).
List long-running and slow queries on PostgreSQL. Poorly written queries or poorly structured data can cause very long execution times on your database. Typically discovered through slow response or extended increases in database CPU, the pg_stat_activity view can help to find out what query is causing issues.
The speed of your database generally isn't your hardware, it's how well you use the intelligence and features of the engine itself.
Try to avoid subselects - especially when dealing with a lot of data. These often can't be optimised by the query planner. In most cases you should be able to convert simple subselects into JOINs or even separate database lookups before hand if needed.
Partition your tables - PostgreSQL doesn't do this natively (sort of) but if you very frequently access only the recent data you could remove a lot of work by moving the archive data out of the way.
Consider a data warehousing strategy - when your dealing with that amount of data you should consider storing a copy of the data in a denormalised way that is very quick to retrieve because the nasty JOINs have already been taken care of. We do this with Redshift (a derivative of PostgeSQL) so that we do not need to do any JOINs when running reports.
Its very hard to predict optimization without try and try....so try one by one. And good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With