I have a table w/ 4.5 million rows. There is no primary key. The table has a column p_id
, type integer. There's an index, idx_mytable_p_id
on this column using the btree
method. I do:
SELECT * FROM mytable WHERE p_id = 123456;
I run an explain on this and see the following output:
Bitmap Heap Scan on mytable (cost=12.04..1632.35 rows=425 width=321) Recheck Cond: (p_id = 543094) -> Bitmap Index Scan on idx_mytable_p_id (cost=0.00..11.93 rows=425 width=0) Index Cond: (p_id = 543094)
Questions:
For the record there are 773 rows with the p_id
value of 123456. There are 38 columns on mytable
.
Thanks!
A bitmap heap scan takes a row location bitmap generated by a Bitmap Index Scan (either directly, or through a series of bitmap set operations via BitmapAnd and BitmapOr nodes) and looks up the relevant data.
Blocks are handed out one at a time, so that access to the table remains sequential. In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan of one or more indexes and builds a bitmap indicating which table blocks need to be visited.
If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap).
Seq Scan. The Seq Scan operation scans the entire relation (table) as stored on disk (like TABLE ACCESS FULL ). Index Scan. The Index Scan performs a B-tree traversal, walks through the leaf nodes to find all matching entries, and fetches the corresponding table data.
Why is that query doing a heap scan and then a bitmap index scan?
It's not, exactly. EXPLAIN output shows the structure of the execution nodes, with the ones on the "higher" level (not indented as far) pulling rows from the nodes below them. So when the Bitmap Heap Scan node goes to pull its first row the Bitmap Index Scan runs to determine the set of rows to be used, and passes information on the first row to the heap scan. The index scan passes the index to determine which rows need to be read, and the heap scan actually reads them. The idea is that by reading the heap from beginning to end rather than in index order it will do less random access -- all matching rows from a given page will be read when that page is loaded, and enough pages may be read in order to use cheaper sequential access rather than seeking back and forth all over the disk.
Why is it examining 425 rows?
It's not. You ran EXPLAIN, which just shows you estimates and the chosen plan, it doesn't really examine the rows at all. That makes the value of EXPLAIN rather limited compared to running EXPLAIN ANALYZE, which actually runs the query and shows you the estimates and the actual numbers.
Why is the width of the operation 321?
Apparently that's the size, in bytes, of the tuples in mytable
.
What is the cost of 12.04..1632.35 and 0.00..11.93 telling me?
The first number is the cost to return the first row from that node; the second number is the cost to return all of the rows for that node. Remember, these are estimates. The unit is an abstract cost unit. The absolute number means nothing; what matters in planning is which plan has the lowest cost. If you are using a cursor the first number matters; otherwise it is usually the second number. (I think it interpolates for a LIMIT clause.)
It is often necessary to adjust configurable cost factors, such as random_page_cost
and cpu_tuple_cost
, to accurately model the costs within your environment. Without such adjustments the comparative costs are likely to not match the corresponding run times, so a less-than-optimal plan might be chosen.
re 1) execution plans have to be read from the inner most node to the outermost node. So it's first doing an index scan (to find the rows) and the accessing the actual table to return the rows the index scan found
re 2) the number of rows shown in the plan is just an estimation based on the statistics and as such 425 vs. 773 sounds fairly reasonable. If you want to see real figures, use explain analyze
re 3) the first number in the cost figure is the "startup" cost to intialize the step of the planner, the second cost is the total cost of that step.
This is all documented in the manual: http://www.postgresql.org/docs/current/static/using-explain.html
You might want to go through these links in the PostgreSQL Wiki as well:
PostgreSQL EXPLAIN
Using Explain
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With