I am trying to do a left join between two tables in a postgreSQL database and finding it takes about 14 minutes to run. From existing SO posts, it seems like this type of join should be on the order of seconds, so I'd like to know how to improve the performance of this join. I'm running 64-bit
postgreSQL version 9.4.4
on a Windows 8
machine w/ 8 GB RAM
, using pgAdmin III
. The table structures are as follows:
Table A: "parcels_qtr":
parcel (text) | yr (int) | qtr (text) | lpid (pk, text) |
Has 15.5 million rows, each column is indexed, and "lpid" is the primary key. I also ran this table through a standard vacuum process.
Table B: "postalvac_qtr":
parcel (text) | yr (int) | qtr (text) | lpid (pk, text) | vacCountY (int) |
Has 618,000 records, all fields except "vacCountY" are indexed and "lpid" is the primary key. This also has gone through a standard vacuum process.
When running with data output, it takes about 14 min. When running with explain (analyze, buffers)
it takes a little over a minute. First question - is this difference in performance wholly attributable to printing the data or is something else going on here?
And second question, can I get this run time down to a few seconds?
Here is my SQL code:
EXPLAIN (ANALYZE, BUFFERS)
select a.parcel,
a.lpid,
a.yr,
a.qtr,
b."vacCountY"
from parcels_qtr as a
left join postalvac_qtr as b
on a.lpid = b.lpid;
And here are the results of my explain statement: https://explain.depesz.com/s/uKkK
I'm pretty new to postgreSQL so patience and explanations would be greatly appreciated!
First of all, indexes are required to speed up the query. If you do not have any, you probably should create some (depending on the query you perform). And if you do multiple LEFT JOINs, then you could (probably) separate them into different queries and this should make the application work a lot faster.
Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient.
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.
You're asking the DB to do quite a bit of work. Just looking at the explain plan, it's:
postalvac_qtr
)lpid
parcels_qtr
)lpid
s, and match them to the existing hash tableHow large are these tables? You can check this by issuing:
SELECT pg_size_pretty(pg_relation_size('parcels_qtr'));
I'm almost certain that this hash join is spilling out to disk, and the way it's structured ("give me all of the data from both of these tables"), there's no way it won't.
The indices don't help, and can't. As long as you're asking for the entirety of a table, using an index would only make things slower -- postgres has to traverse the entire table anyway, so it might as well issue a sequential scan.
As for why the query has different performance than the explain analyze
, I suspect you're correct. A combination of 1- sending 15M rows to your client, and 2- trying to display it, is going to cause a significant slowdown above and beyond the actual query.
So, what can you do about it?
First, what is this query trying to do? How often do you want to grab all of the data in those two tables, completely unfiltered? If it's very common, you may want to consider going back to the requirements stage and figuring out another way to address that need (e.g. would it be reasonable to grab all the data for a given year and quarter instead?). If it's uncommon (say, a daily export), then 1-14min might be fine.
Second, you should make sure that your tables aren't bloated. If you experience significant update
or delete
traffic on your tables, that can grow them over time. The autovacuum daemon is there to help deal with this, but occasionally issuing a vacuum full
will also help.
Third, you can try tuning your DB config. In postgresql.conf
, there are parameters for things like the expected amount of RAM that your server can use for disk cache, and the amount of RAM the server can use for sorting or joining (before it spills out to disk). By tinkering with these sorts of parameters, you might be able to improve the speed.
Fourth, you might want to revisit your schema. Do you want year and quarter as two separate columns, or would you be better off with a single column of the date
type? Do you want a text
key, or would you be better off with a bigint
(either serial or derived from the text
column), which will likely join more quickly? Are the parcel
, yr
, and qtr
fields actually needed in both tables, or are they duplicate data in one table?
Anyway, I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With