Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL query taking too long

I have database with few hundred millions of rows. I'm running the following query:

select * from "Payments" as p
inner join "PaymentOrders" as po
on po."Id" = p."PaymentOrderId"
inner join "Users" as u
On u."Id" = po."UserId"
INNER JOIN "Roles" as r
on u."RoleId" = r."Id"
Where r."Name" = 'Moses'
LIMIT 1000

When the where clause finds a match in database, I get the result in several milliseconds, but if I modify the query and specify a non-existent r."Name" in where clause, it takes too much time to complete. I guess that PostgreSQL is doing a sequential scan on the Payments table (which contains the most rows), comparing each row one by one.

Isn't postgresql smart enough to check first if Roles table contains any row with Name 'Moses'?

Roles table contains only 15 row, while Payments contains ~350 million.

I'm running PostgreSQL 9.2.1.

BTW, this same query on the same schema/data takes 0.024ms to complete on MS SQL Server.

I'll update the question and post EXPLAIN ANALYSE data in a few hours.


Here'e explain analyse results: http://explain.depesz.com/s/7e7


And here's server configuration:

version PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
client_encoding UNICODE
effective_cache_size    4500MB
fsync   on
lc_collate  English_United States.1252
lc_ctype    English_United States.1252
listen_addresses    *
log_destination stderr
log_line_prefix %t 
logging_collector   on
max_connections 100
max_stack_depth 2MB
port    5432
search_path dbo, "$user", public
server_encoding UTF8
shared_buffers  1500MB
TimeZone    Asia/Tbilisi
wal_buffers 16MB
work_mem    10MB

I'm running postgresql on a i5 cpu (4 core, 3.3 GHz), 8 GB of RAM and Crucial m4 SSD 128GB


UPDATE This looks like a bug in query planner. With the recomendation of Erwin Brandstetter I reported it to Postgresql bugs mailing list.

like image 643
Davita Avatar asked Nov 15 '12 22:11

Davita


People also ask

How do I fix slow query in PostgreSQL?

So How Do You Fix Slow Queries in PostgreSQL? To speed up this particular PostgreSQL slow query, we need to know whether we really need all rows. If not, we should only get N of them by adding a LIMIT clause. If they are, we should use a cursor.

How make PostgreSQL query run faster?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.

Why is my Postgres so slow?

Disk Access. PostgreSQL attempts to do a lot of its work in memory, and spread out writing to disk to minimize bottlenecks, but on an overloaded system with heavy writing, it's easily possible to see heavy reads and writes cause the whole system to slow as it catches up on the demands.

Can Postgres handle 100 million rows?

PostgreSQL does not impose a limit on the number of rows in any table. There is no PostgreSQL-imposed limit on the number of indexes you can create on a table. Of course, performance may degrade if you choose to create more and more indexes on a table with more and more columns.


1 Answers

As suggested a couple times on the thread on the PostgreSQL community performance list, you can work around this issue by forcing an optimization barrier using a CTE, like this:

WITH x AS
(
SELECT *
  FROM "Payments" AS p
  JOIN "PaymentOrders" AS po ON po."Id" = p."PaymentOrderId"
  JOIN "Users" as u ON u."Id" = po."UserId"
  JOIN "Roles" as r ON u."RoleId" = r."Id"
  WHERE r."Name" = 'Moses'
)
SELECT * FROM x
  LIMIT 1000;

You may also get a good plan for your original query if you set a higher statistics target for "Roles"."Name" and then ANALYZE. For example:

ALTER TABLE "Roles"
  ALTER COLUMN "Name" SET STATISTICS 1000;
ANALYZE "Roles";

If it expects fewer matching rows to exist in the table, as it is likely to do with more fine-grained statistics, it will assume that it needs to read a higher percentage of the table to find them on a sequential scan. This may cause it to prefer using the index instead of sequentially scanning the table.

You might also get a better plan for the original query by adjusting some of the planner's costing constants and caching assumptions. Things you could try in a single session, with the SET command:

  • Reduce random_page_cost. This is largely based on how heavily cached your data is. Given a table with hundreds of millions of rows you probably don't want to go below 2; although if the active data set in your database is heavily cached you can reduce it all the way down to the setting for seq_page_cost, and you may want to reduce both of them by an order of magnitude.

  • Make sure that effective_cache_size is set to the sum of shared_buffers and whatever your OS is caching. This doesn't allocate any memory; it just tells the optimizer how likely index pages are to remain in cache during heavy access. A higher setting makes indexes look better when compared to sequential scans.

  • Increase cpu_tuple_cost to somewhere in the range of 0.03 to 0.05. I have found the default of 0.01 to be too low. I often get better plans by increasing it, and have never seen a value in the range I suggested cause worse plans to be chosen.

  • Make sure that your work_mem setting is reasonable. In most environments that I've run PostgreSQL, that is in the 16MB to 64MB range. This will allow better use of hash tables, bitmap index scans, sorts, etc., and can completely change your plans; almost always for the better. Beware setting this to a level that yields good plans if you have a large number of connections -- you should allow for the fact that each connection can allocate this much memory per node of the query it is running. The "rule of thumb" is to figure you will hit peaks around this setting times max_connections. This is one of the reasons that it is wise to limit your actual number of database connections using a connection pool.

If you find a good combination of settings for these, you might want to make those changes to your postgresql.conf file. If you do that, monitor closely for performance regressions, and be prepared to tweak the settings for the best performance of your overall load.

I agree that we need to do something to nudge the optimizer away from "risky" plans, even if they look like they will run faster on average; but I will be a little surprised if tuning your configuration so that the optimizer better models the actual costs of each alternative doesn't cause it to use an efficient plan.

like image 191
kgrittn Avatar answered Oct 20 '22 19:10

kgrittn