Improving performance for simple left join in postgreSQL

Tags:

I am trying to do a left join between two tables in a postgreSQL database and finding it takes about 14 minutes to run. From existing SO posts, it seems like this type of join should be on the order of seconds, so I'd like to know how to improve the performance of this join. I'm running 64-bit postgreSQL version 9.4.4 on a Windows 8 machine w/ 8 GB RAM, using pgAdmin III. The table structures are as follows:

Table A: "parcels_qtr":

parcel (text) | yr (int) | qtr (text) | lpid (pk, text) |

Has 15.5 million rows, each column is indexed, and "lpid" is the primary key. I also ran this table through a standard vacuum process.

Table B: "postalvac_qtr":

parcel (text) | yr (int) | qtr (text) | lpid (pk, text) | vacCountY (int) |

Has 618,000 records, all fields except "vacCountY" are indexed and "lpid" is the primary key. This also has gone through a standard vacuum process.

When running with data output, it takes about 14 min. When running with explain (analyze, buffers) it takes a little over a minute. First question - is this difference in performance wholly attributable to printing the data or is something else going on here?

And second question, can I get this run time down to a few seconds?

Here is my SQL code:

EXPLAIN (ANALYZE, BUFFERS)
select a.parcel,
   a.lpid,
   a.yr,
   a.qtr,
   b."vacCountY"
from parcels_qtr as a
left join postalvac_qtr as b
on a.lpid = b.lpid;

And here are the results of my explain statement: https://explain.depesz.com/s/uKkK

I'm pretty new to postgreSQL so patience and explanations would be greatly appreciated!

488

asked Aug 01 '16 16:08

Parker

1 Answers

You're asking the DB to do quite a bit of work. Just looking at the explain plan, it's:

Read in an entire table (postalvac_qtr)
Build a hash based on lpid
Read in an entire other, much larger, table (parcels_qtr)
Hash each of the 15MM lpids, and match them to the existing hash table

How large are these tables? You can check this by issuing:

SELECT pg_size_pretty(pg_relation_size('parcels_qtr'));

I'm almost certain that this hash join is spilling out to disk, and the way it's structured ("give me all of the data from both of these tables"), there's no way it won't.

The indices don't help, and can't. As long as you're asking for the entirety of a table, using an index would only make things slower -- postgres has to traverse the entire table anyway, so it might as well issue a sequential scan.

As for why the query has different performance than the explain analyze, I suspect you're correct. A combination of 1- sending 15M rows to your client, and 2- trying to display it, is going to cause a significant slowdown above and beyond the actual query.

So, what can you do about it?

First, what is this query trying to do? How often do you want to grab all of the data in those two tables, completely unfiltered? If it's very common, you may want to consider going back to the requirements stage and figuring out another way to address that need (e.g. would it be reasonable to grab all the data for a given year and quarter instead?). If it's uncommon (say, a daily export), then 1-14min might be fine.

Second, you should make sure that your tables aren't bloated. If you experience significant update or delete traffic on your tables, that can grow them over time. The autovacuum daemon is there to help deal with this, but occasionally issuing a vacuum full will also help.

Third, you can try tuning your DB config. In postgresql.conf, there are parameters for things like the expected amount of RAM that your server can use for disk cache, and the amount of RAM the server can use for sorting or joining (before it spills out to disk). By tinkering with these sorts of parameters, you might be able to improve the speed.

Fourth, you might want to revisit your schema. Do you want year and quarter as two separate columns, or would you be better off with a single column of the date type? Do you want a text key, or would you be better off with a bigint (either serial or derived from the text column), which will likely join more quickly? Are the parcel, yr, and qtr fields actually needed in both tables, or are they duplicate data in one table?

Anyway, I hope this helps.

115

answered Oct 23 '22 13:10

jmelesky

Related questions
                            
                                what data-type to use to store MAC addresses in an SQL Server database table?
                            
                                PostgreSQL date difference
                            
                                SQL Server 2008 - Select distinct names into variable
                            
                                best way to catch database constraint errors
                            
                                Defining size of CLOB in Oracle
                            
                                SQl: Update Table from a text File
                            
                                SQL Server drop and recreate indexes of a table
                            
                                PHP PDO: Unable to connect, Invalid catalog name
                            
                                Value return when no rows in PDO
                            
                                Issue with subquery - All expressions must have explicit name
                            
                                MySQL - Cannot Create View with SET variable inside
                            
                                How to connect to sql server database via LAN
                            
                                Create multiple tables using single .sql script file
                            
                                Concatenate string in statement that assigns a variable in PostgreSQL
                            
                                How to list all the user tables in SQL Anywhere along with their rowcount?
                            
                                Delete statements locks table
                            
                                PostgreSQL multi INSERT...RETURNING with multiple columns
                            
                                What's different between INTERSECT and JOIN?
                            
                                Rails - How to properly escape values in a update_all query
                            
                                why sql give me error "Invalid parameter number: parameter was not defined"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving performance for simple left join in postgreSQL

Tags:

performance

sql

database

join

postgresql

Parker

People also ask

1 Answers

jmelesky

Recent Activity

Donate For Us