Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I increase the speed of my Postgres select statement?

I have the following tables:

CREATE TABLE views (
    view_id bigint NOT NULL,
    usr_id bigint,
    ip inet,
    referer_id bigint,
    country_id integer,
    validated smallint,
    completed smallint,
    value numeric
);

ALTER TABLE ONLY views
    ADD CONSTRAINT "Views_pkey" PRIMARY KEY (view_id);

CREATE TABLE country (
    country_id integer NOT NULL,
    country character varying(2)
);

ALTER TABLE ONLY country
    ADD CONSTRAINT country_pkey PRIMARY KEY (country_id);

CREATE TABLE file_id_view_id (
    file_id bigint,
    view_id bigint,
    created_ts timestamp without time zone
);

CREATE TABLE file_owner (
    file_id bigint NOT NULL,
    owner_id bigint
);

ALTER TABLE ONLY file_owner
        ADD CONSTRAINT owner_table_pkey PRIMARY KEY (file_id);

CREATE TABLE referer (
    referer_id bigint NOT NULL,
    referer character varying(255)
);

ALTER TABLE ONLY referer
    ADD CONSTRAINT referer_pkey PRIMARY KEY (referer_id);

The views and file_id_view_id table have approximately 340M rows each. Each hour they will both increase by 600K rows.

The file_owner table has 75K rows and will increase hourly by 100 rows.

The country table has 233 rows and rarely changes.

The referer table has 6494 rows and rarely changes.

My goal is to be able to perform a query such as:

SELECT Count(ft.*)                     AS total_views,
       ( Count(ft.*) - SUM(ft.valid) ) AS invalid_views,
       SUM(ft.valid)                   AS valid_views,
       SUM(ft.values)                  AS VALUES,
       ft.day                          AS day,
       ( CASE
           WHEN r.referer IS NULL THEN 'Unknown'
           ELSE r.referer
         END )                         AS referer,
       ( CASE
           WHEN c.country IS NULL THEN 'Unknown'
           ELSE c.country
         END )                         AS country
FROM   country c
       right join (referer r
                   right join (SELECT v.validated  AS valid,
                                      v.value      AS VALUES,
                                      vf.day       AS day,
                                      vf.view_id   AS view_id,
                                      v.referer_id AS referer_id,
                                      v.country_id AS country_id
                               FROM   VIEWS v,
                                      (SELECT view_id,
fivi.created_ts :: timestamp :: DATE AS
day
FROM   file_id_view_id fivi
join (SELECT file_id
      FROM   file_owner
      WHERE  owner_id = 75
      GROUP  BY file_id) fo
  ON ( fo.file_id = fivi.file_id )
WHERE  ( fivi.created_ts BETWEEN
  '2015-11-01' AND '2015-12-01' )
GROUP  BY view_id,
   day) vf
WHERE  v.view_id = vf.view_id) ft
ON ( ft.referer_id = r.referer_id ))
ON ( ft.country_id = c.country_id )
GROUP  BY day,
          referer,
          country;

To produce:

total_views | invalid_views | valid_views | values |    day     |     referer     | country 
------------+---------------+-------------+--------+------------+-----------------+---------

When running such query with EXPLAIN ANALYZE the following is produced:

GroupAggregate  (cost=38893491.99..40443007.61 rows=182295955 width=52) (actual time=183725.696..205882.889 rows=172 loops=1)
  Group Key: ((fivi.created_ts)::date), r.referer, c.country
  ->  Sort  (cost=38893491.99..38984639.97 rows=182295955 width=52) (actual time=183725.655..200899.098 rows=8390217 loops=1)
        Sort Key: ((fivi.created_ts)::date), r.referer, c.country
        Sort Method: external merge  Disk: 420192kB
        ->  Hash Left Join  (cost=16340128.88..24989809.75 rows=182295955 width=52) (actual time=23399.900..104337.332 rows=8390217 loops=1)
              Hash Cond: (v.country_id = c.country_id)
              ->  Hash Left Join  (cost=16340125.36..24800637.72 rows=182295955 width=49) (actual time=23399.782..102534.655 rows=8390217 loops=1)
                    Hash Cond: (v.referer_id = r.referer_id)
                    ->  Merge Join  (cost=16340033.52..24051874.62 rows=182295955 width=29) (actual time=23397.410..99955.000 rows=8390217 loops=1)
                          Merge Cond: (fivi.view_id = v.view_id)
                          ->  Group  (cost=16340033.41..16716038.36 rows=182295955 width=16) (actual time=23397.298..30454.444 rows=8390217 loops=1)
                                Group Key: fivi.view_id, ((fivi.created_ts)::date)
                                ->  Sort  (cost=16340033.41..16434985.73 rows=189904653 width=16) (actual time=23397.294..28165.729 rows=8390217 loops=1)
                                      Sort Key: fivi.view_id, ((fivi.created_ts)::date)
                                      Sort Method: external merge  Disk: 180392kB
                                      ->  Nested Loop  (cost=6530.43..8799350.01 rows=189904653 width=16) (actual time=63.123..15131.956 rows=8390217 loops=1)
                                            ->  HashAggregate  (cost=6530.31..6659.62 rows=43104 width=8) (actual time=62.983..90.331 rows=43887 loops=1)
                                                  Group Key: file_owner.file_id
                                                  ->  Bitmap Heap Scan on file_owner  (cost=342.90..6508.76 rows=43104 width=8) (actual time=5.407..50.779 rows=43887 loops=1)
                                                        Recheck Cond: (owner_id = 75)
                                                        Heap Blocks: exact=5904
                                                        ->  Bitmap Index Scan on owner_id_index  (cost=0.00..340.74 rows=43104 width=0) (actual time=4.327..4.327 rows=45576 loops=1)
                                                              Index Cond: (owner_id = 75)
                                            ->  Index Scan using file_id_view_id_indexing on file_id_view_id fivi  (cost=0.11..188.56 rows=4406 width=24) (actual time=0.122..0.306 rows=191 loops=43887)
                                                  Index Cond: (file_id = file_owner.file_id)
                                                  Filter: ((created_ts >= '2015-11-01 00:00:00'::timestamp without time zone) AND (created_ts <= '2015-12-01 00:00:00'::timestamp without time zone))
                                                  Rows Removed by Filter: 184
                          ->  Index Scan using "Views_pkey" on views v  (cost=0.11..5981433.17 rows=338958763 width=25) (actual time=0.088..46804.757 rows=213018702 loops=1)
                    ->  Hash  (cost=68.77..68.77 rows=6591 width=28) (actual time=2.344..2.344 rows=6495 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 410kB
                          ->  Seq Scan on referer r  (cost=0.00..68.77 rows=6591 width=28) (actual time=0.006..1.156 rows=6495 loops=1)
              ->  Hash  (cost=2.70..2.70 rows=233 width=7) (actual time=0.078..0.078 rows=233 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 10kB
                    ->  Seq Scan on country c  (cost=0.00..2.70 rows=233 width=7) (actual time=0.005..0.042 rows=233 loops=1)
Planning time: 1.015 ms
Execution time: 206034.660 ms
(37 rows)

Plan on explain.depesz.com: http://explain.depesz.com/s/OiN

206s run time.

Some things to note,

Postgresql Version 9.4

I have adjusted the config as follows:

  1. shared_buffers = 30GB
  2. work_mem = 32MB
  3. random_page_cost = 2.0
  4. cpu_tuple_cost = 0.0030
  5. cpu_index_tuple_cost = 0.0010
  6. cpu_operator_cost = 0.0005
  7. effective_cache_size = 52GB

The following indexes currently exist:

  1. CREATE INDEX country_index ON country USING btree (country);
  2. CREATE INDEX created_ts_index ON file_id_view_id USING btree (created_ts);
  3. CREATE INDEX file_id_created_ts_index ON file_id_view_id USING btree (created_ts, file_id);
  4. CREATE INDEX file_id_view_id_indexing ON file_id_view_id USING btree (file_id);
  5. CREATE INDEX owner_id_file_id_index ON file_owner USING btree (file_id, owner_id);
  6. CREATE INDEX owner_id_index ON file_owner USING btree (owner_id);
  7. CREATE INDEX referer_index ON referer USING btree (referer);

The previous query was using an owner id which was picked conservatively, some queries may result in 1/3 of the file_id_view_id table being joined with views.

Changing the data structure is a last resort. At this stage such a change must be due to serious concerns.

The db can be considered as read only if need be, the data being written is done on an hourly basis and plenty of breathing room is given to Postgres after each write. At the current moment during a 600K hourly write the db is returning in 1100s (this is due to other reasons alongside the insert cost). There is plenty of room to add additional indexes if it would increase the read speeds, the read speed is the priority.

The hardware specifications are:

CPU: http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz

RAM: 128GB

STORAGE: 1.5TB PCIE SSD

How can I optimize either my database or query so that I can retrieve information I need out of the db in a reasonable time frame?

What can I do to optimize my current design?

I believe Postgres and the hardware it is running on has the capability to perform much better than it is currently.

UPDATE

I have tried:

  1. Analyze the tables, did not affect performance.
  2. Increase the work_mem, this has resulted in a speed increase to 116s.
  3. Depend on Postgres' query planner by avoiding sub-selects, this negatively effects performance.
  4. Separate db lookups before hand, this has seemingly no positive/negative effect.

Does anyone have any experience restructuring tables this big? Is it feasible? Will it take days, hours (estimating of course)?

I am considering de-normalizing the database, since it really will only be referenced in this method. My only concern with that is - if 100M rows were to be called from a table with an indexed owner_id would it be fast enough or would I still be facing the same performance issues? Would hate to go one way then have to backtrack.

The other solution I am looking into is @ivan.panasuik suggestion, group all the day data into another table since once the day has passed that information is constant and does not need to be changed or updated. However I am unsure how to implement this smoothly - should I have queries run through the data while the inserts are on hold and catch the days up as fast as possible? From then on have a trigger set?

like image 200
Daniel B Avatar asked Dec 14 '15 22:12

Daniel B


People also ask

How make PostgreSQL query run faster?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.

What is PostgreSQL query optimization?

Just like any advanced relational database, PostgreSQL uses a cost-based query optimizer that tries to turn your SQL queries into something efficient that executes in as little time as possible.

What is Pgbench in PostgreSQL?

Description. pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second).

Why is PostgreSQL slow?

List long-running and slow queries on PostgreSQL. Poorly written queries or poorly structured data can cause very long execution times on your database. Typically discovered through slow response or extended increases in database CPU, the pg_stat_activity view can help to find out what query is causing issues.


Video Answer


2 Answers

The speed of your database generally isn't your hardware, it's how well you use the intelligence and features of the engine itself.

  1. Try to avoid subselects - especially when dealing with a lot of data. These often can't be optimised by the query planner. In most cases you should be able to convert simple subselects into JOINs or even separate database lookups before hand if needed.

  2. Partition your tables - PostgreSQL doesn't do this natively (sort of) but if you very frequently access only the recent data you could remove a lot of work by moving the archive data out of the way.

  3. Consider a data warehousing strategy - when your dealing with that amount of data you should consider storing a copy of the data in a denormalised way that is very quick to retrieve because the nasty JOINs have already been taken care of. We do this with Redshift (a derivative of PostgeSQL) so that we do not need to do any JOINs when running reports.

like image 162
Elliot Chance Avatar answered Oct 22 '22 16:10

Elliot Chance


  1. Remove ( Count(ft.*) - SUM(ft.valid) ) AS invalid_views, since you already have this values and you can calculate it later, during display results
  2. Add index on file_owner.file_id and check that every single field that using in query has index (fields that you are using in conditions: where, group, etc.)
  3. I didn't analyse the query more, but it seems that you should split query in a few smaller (and faster) queries and connect it using temp tables or stored procedure.
  4. Assuming that result for yesterday is something that would not change...you could run query with condition day = today() and avoid grouping by day. All days results you could save in one separate table. I see that most of time it works grouping.

Its very hard to predict optimization without try and try....so try one by one. And good luck.

like image 21
ivan.panasiuk Avatar answered Oct 22 '22 15:10

ivan.panasiuk