How do I increase the speed of my Postgres select statement?

Tags:

I have the following tables:

CREATE TABLE views (
    view_id bigint NOT NULL,
    usr_id bigint,
    ip inet,
    referer_id bigint,
    country_id integer,
    validated smallint,
    completed smallint,
    value numeric
);

ALTER TABLE ONLY views
    ADD CONSTRAINT "Views_pkey" PRIMARY KEY (view_id);

CREATE TABLE country (
    country_id integer NOT NULL,
    country character varying(2)
);

ALTER TABLE ONLY country
    ADD CONSTRAINT country_pkey PRIMARY KEY (country_id);

CREATE TABLE file_id_view_id (
    file_id bigint,
    view_id bigint,
    created_ts timestamp without time zone
);

CREATE TABLE file_owner (
    file_id bigint NOT NULL,
    owner_id bigint
);

ALTER TABLE ONLY file_owner
        ADD CONSTRAINT owner_table_pkey PRIMARY KEY (file_id);

CREATE TABLE referer (
    referer_id bigint NOT NULL,
    referer character varying(255)
);

ALTER TABLE ONLY referer
    ADD CONSTRAINT referer_pkey PRIMARY KEY (referer_id);

The views and file_id_view_id table have approximately 340M rows each. Each hour they will both increase by 600K rows.

The file_owner table has 75K rows and will increase hourly by 100 rows.

The country table has 233 rows and rarely changes.

The referer table has 6494 rows and rarely changes.

My goal is to be able to perform a query such as:

SELECT Count(ft.*)                     AS total_views,
       ( Count(ft.*) - SUM(ft.valid) ) AS invalid_views,
       SUM(ft.valid)                   AS valid_views,
       SUM(ft.values)                  AS VALUES,
       ft.day                          AS day,
       ( CASE
           WHEN r.referer IS NULL THEN 'Unknown'
           ELSE r.referer
         END )                         AS referer,
       ( CASE
           WHEN c.country IS NULL THEN 'Unknown'
           ELSE c.country
         END )                         AS country
FROM   country c
       right join (referer r
                   right join (SELECT v.validated  AS valid,
                                      v.value      AS VALUES,
                                      vf.day       AS day,
                                      vf.view_id   AS view_id,
                                      v.referer_id AS referer_id,
                                      v.country_id AS country_id
                               FROM   VIEWS v,
                                      (SELECT view_id,
fivi.created_ts :: timestamp :: DATE AS
day
FROM   file_id_view_id fivi
join (SELECT file_id
      FROM   file_owner
      WHERE  owner_id = 75
      GROUP  BY file_id) fo
  ON ( fo.file_id = fivi.file_id )
WHERE  ( fivi.created_ts BETWEEN
  '2015-11-01' AND '2015-12-01' )
GROUP  BY view_id,
   day) vf
WHERE  v.view_id = vf.view_id) ft
ON ( ft.referer_id = r.referer_id ))
ON ( ft.country_id = c.country_id )
GROUP  BY day,
          referer,
          country;

To produce:

total_views | invalid_views | valid_views | values |    day     |     referer     | country 
------------+---------------+-------------+--------+------------+-----------------+---------

When running such query with EXPLAIN ANALYZE the following is produced:

GroupAggregate  (cost=38893491.99..40443007.61 rows=182295955 width=52) (actual time=183725.696..205882.889 rows=172 loops=1)
  Group Key: ((fivi.created_ts)::date), r.referer, c.country
  ->  Sort  (cost=38893491.99..38984639.97 rows=182295955 width=52) (actual time=183725.655..200899.098 rows=8390217 loops=1)
        Sort Key: ((fivi.created_ts)::date), r.referer, c.country
        Sort Method: external merge  Disk: 420192kB
        ->  Hash Left Join  (cost=16340128.88..24989809.75 rows=182295955 width=52) (actual time=23399.900..104337.332 rows=8390217 loops=1)
              Hash Cond: (v.country_id = c.country_id)
              ->  Hash Left Join  (cost=16340125.36..24800637.72 rows=182295955 width=49) (actual time=23399.782..102534.655 rows=8390217 loops=1)
                    Hash Cond: (v.referer_id = r.referer_id)
                    ->  Merge Join  (cost=16340033.52..24051874.62 rows=182295955 width=29) (actual time=23397.410..99955.000 rows=8390217 loops=1)
                          Merge Cond: (fivi.view_id = v.view_id)
                          ->  Group  (cost=16340033.41..16716038.36 rows=182295955 width=16) (actual time=23397.298..30454.444 rows=8390217 loops=1)
                                Group Key: fivi.view_id, ((fivi.created_ts)::date)
                                ->  Sort  (cost=16340033.41..16434985.73 rows=189904653 width=16) (actual time=23397.294..28165.729 rows=8390217 loops=1)
                                      Sort Key: fivi.view_id, ((fivi.created_ts)::date)
                                      Sort Method: external merge  Disk: 180392kB
                                      ->  Nested Loop  (cost=6530.43..8799350.01 rows=189904653 width=16) (actual time=63.123..15131.956 rows=8390217 loops=1)
                                            ->  HashAggregate  (cost=6530.31..6659.62 rows=43104 width=8) (actual time=62.983..90.331 rows=43887 loops=1)
                                                  Group Key: file_owner.file_id
                                                  ->  Bitmap Heap Scan on file_owner  (cost=342.90..6508.76 rows=43104 width=8) (actual time=5.407..50.779 rows=43887 loops=1)
                                                        Recheck Cond: (owner_id = 75)
                                                        Heap Blocks: exact=5904
                                                        ->  Bitmap Index Scan on owner_id_index  (cost=0.00..340.74 rows=43104 width=0) (actual time=4.327..4.327 rows=45576 loops=1)
                                                              Index Cond: (owner_id = 75)
                                            ->  Index Scan using file_id_view_id_indexing on file_id_view_id fivi  (cost=0.11..188.56 rows=4406 width=24) (actual time=0.122..0.306 rows=191 loops=43887)
                                                  Index Cond: (file_id = file_owner.file_id)
                                                  Filter: ((created_ts >= '2015-11-01 00:00:00'::timestamp without time zone) AND (created_ts <= '2015-12-01 00:00:00'::timestamp without time zone))
                                                  Rows Removed by Filter: 184
                          ->  Index Scan using "Views_pkey" on views v  (cost=0.11..5981433.17 rows=338958763 width=25) (actual time=0.088..46804.757 rows=213018702 loops=1)
                    ->  Hash  (cost=68.77..68.77 rows=6591 width=28) (actual time=2.344..2.344 rows=6495 loops=1)
                          Buckets: 1024  Batches: 1  Memory Usage: 410kB
                          ->  Seq Scan on referer r  (cost=0.00..68.77 rows=6591 width=28) (actual time=0.006..1.156 rows=6495 loops=1)
              ->  Hash  (cost=2.70..2.70 rows=233 width=7) (actual time=0.078..0.078 rows=233 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 10kB
                    ->  Seq Scan on country c  (cost=0.00..2.70 rows=233 width=7) (actual time=0.005..0.042 rows=233 loops=1)
Planning time: 1.015 ms
Execution time: 206034.660 ms
(37 rows)

Plan on explain.depesz.com: http://explain.depesz.com/s/OiN

206s run time.

Some things to note,

Postgresql Version 9.4

I have adjusted the config as follows:

shared_buffers = 30GB
work_mem = 32MB
random_page_cost = 2.0
cpu_tuple_cost = 0.0030
cpu_index_tuple_cost = 0.0010
cpu_operator_cost = 0.0005
effective_cache_size = 52GB

The following indexes currently exist:

CREATE INDEX country_index ON country USING btree (country);
CREATE INDEX created_ts_index ON file_id_view_id USING btree (created_ts);
CREATE INDEX file_id_created_ts_index ON file_id_view_id USING btree (created_ts, file_id);
CREATE INDEX file_id_view_id_indexing ON file_id_view_id USING btree (file_id);
CREATE INDEX owner_id_file_id_index ON file_owner USING btree (file_id, owner_id);
CREATE INDEX owner_id_index ON file_owner USING btree (owner_id);
CREATE INDEX referer_index ON referer USING btree (referer);

The previous query was using an owner id which was picked conservatively, some queries may result in 1/3 of the file_id_view_id table being joined with views.

Changing the data structure is a last resort. At this stage such a change must be due to serious concerns.

The db can be considered as read only if need be, the data being written is done on an hourly basis and plenty of breathing room is given to Postgres after each write. At the current moment during a 600K hourly write the db is returning in 1100s (this is due to other reasons alongside the insert cost). There is plenty of room to add additional indexes if it would increase the read speeds, the read speed is the priority.

The hardware specifications are:

CPU: http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz

RAM: 128GB

STORAGE: 1.5TB PCIE SSD

How can I optimize either my database or query so that I can retrieve information I need out of the db in a reasonable time frame?

What can I do to optimize my current design?

I believe Postgres and the hardware it is running on has the capability to perform much better than it is currently.

UPDATE

I have tried:

Analyze the tables, did not affect performance.
Increase the work_mem, this has resulted in a speed increase to 116s.
Depend on Postgres' query planner by avoiding sub-selects, this negatively effects performance.
Separate db lookups before hand, this has seemingly no positive/negative effect.

Does anyone have any experience restructuring tables this big? Is it feasible? Will it take days, hours (estimating of course)?

I am considering de-normalizing the database, since it really will only be referenced in this method. My only concern with that is - if 100M rows were to be called from a table with an indexed owner_id would it be fast enough or would I still be facing the same performance issues? Would hate to go one way then have to backtrack.

The other solution I am looking into is @ivan.panasuik suggestion, group all the day data into another table since once the day has passed that information is constant and does not need to be changed or updated. However I am unsure how to implement this smoothly - should I have queries run through the data while the inserts are on hold and catch the days up as fast as possible? From then on have a trigger set?

200

asked Dec 14 '15 22:12

Daniel B

Video Answer

2 Answers

The speed of your database generally isn't your hardware, it's how well you use the intelligence and features of the engine itself.

Try to avoid subselects - especially when dealing with a lot of data. These often can't be optimised by the query planner. In most cases you should be able to convert simple subselects into JOINs or even separate database lookups before hand if needed.
Partition your tables - PostgreSQL doesn't do this natively (sort of) but if you very frequently access only the recent data you could remove a lot of work by moving the archive data out of the way.
Consider a data warehousing strategy - when your dealing with that amount of data you should consider storing a copy of the data in a denormalised way that is very quick to retrieve because the nasty JOINs have already been taken care of. We do this with Redshift (a derivative of PostgeSQL) so that we do not need to do any JOINs when running reports.

162

answered Oct 22 '22 16:10

Elliot Chance

Remove ( Count(ft.*) - SUM(ft.valid) ) AS invalid_views, since you already have this values and you can calculate it later, during display results
Add index on file_owner.file_id and check that every single field that using in query has index (fields that you are using in conditions: where, group, etc.)
I didn't analyse the query more, but it seems that you should split query in a few smaller (and faster) queries and connect it using temp tables or stored procedure.
Assuming that result for yesterday is something that would not change...you could run query with condition day = today() and avoid grouping by day. All days results you could save in one separate table. I see that most of time it works grouping.

Its very hard to predict optimization without try and try....so try one by one. And good luck.

answered Oct 22 '22 15:10

ivan.panasiuk

Related questions
                            
                                Mass update Excel file connection string
                            
                                Hierarchical table - how to get paths of the items [linked lists in MySQL]
                            
                                Can a foreign key have a constant instead of a field name? Relate FK to STI subclass
                            
                                T-SQL Insert Trigger to insert,update on if condition on multiple tables
                            
                                t-sql Return Error Codes vs RaiseError
                            
                                How to determine LocalDb instance name in SQL script?
                            
                                How to find which OLE provider is available for SQL Server?
                            
                                How to chain multiple T-SQL statements (separated by GO's) into a single call to SQL using SqlCommand
                            
                                Fastest way to search for longest prefix in ORACLE
                            
                                Active Android many-to-many relationship
                            
                                Complex SQL query with multiple tables and relations
                            
                                Declare a variable in Oracle SQL to use in a query
                            
                                Identifying repeated fields in SQL query
                            
                                MySQL - GROUP_CONCAT returns duplicate data, can't use DISTINCT
                            
                                Mysql "select * from" doesn't return all rows
                            
                                InnerException : Table 'xxx.aspnetusers' doesn't exist
                            
                                Detecting patterns in table
                            
                                ORDER BY depending on parameter results in error
                            
                                Nested TransactionScope for NUnit TestFixure and SetUp
                            
                                How to specify field data type when reading from a SQL database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I increase the speed of my Postgres select statement?

Tags:

sql

postgresql

postgresql-performance