Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding postgres explain w/ bitmap heap/index scans

Tags:

postgresql

I have a table w/ 4.5 million rows. There is no primary key. The table has a column p_id, type integer. There's an index, idx_mytable_p_id on this column using the btree method. I do:

SELECT * FROM mytable WHERE p_id = 123456; 

I run an explain on this and see the following output:

Bitmap Heap Scan on mytable  (cost=12.04..1632.35 rows=425 width=321)   Recheck Cond: (p_id = 543094)   ->  Bitmap Index Scan on idx_mytable_p_id  (cost=0.00..11.93 rows=425 width=0)         Index Cond: (p_id = 543094) 

Questions:

  • Why is that query doing a heap scan and then a bitmap index scan?
  • Why is it examining 425 rows? Why is the width of the operation 321?
  • What is the cost of 12.04..1632.35 and 0.00..11.93 telling me?

For the record there are 773 rows with the p_id value of 123456. There are 38 columns on mytable.

Thanks!

like image 466
Wells Avatar asked Apr 13 '12 16:04

Wells


People also ask

What is bitmap heap scan in postgresql?

A bitmap heap scan takes a row location bitmap generated by a Bitmap Index Scan (either directly, or through a series of bitmap set operations via BitmapAnd and BitmapOr nodes) and looks up the relevant data.

What is parallel bitmap heap scan?

Blocks are handed out one at a time, so that access to the table remains sequential. In a parallel bitmap heap scan, one process is chosen as the leader. That process performs a scan of one or more indexes and builds a bitmap indicating which table blocks need to be visited.

Why is Postgres using seq scan instead of index?

If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan. This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap).

What is seq scan in Postgres?

Seq Scan. The Seq Scan operation scans the entire relation (table) as stored on disk (like TABLE ACCESS FULL ). Index Scan. The Index Scan performs a B-tree traversal, walks through the leaf nodes to find all matching entries, and fetches the corresponding table data.


2 Answers

Why is that query doing a heap scan and then a bitmap index scan?

It's not, exactly. EXPLAIN output shows the structure of the execution nodes, with the ones on the "higher" level (not indented as far) pulling rows from the nodes below them. So when the Bitmap Heap Scan node goes to pull its first row the Bitmap Index Scan runs to determine the set of rows to be used, and passes information on the first row to the heap scan. The index scan passes the index to determine which rows need to be read, and the heap scan actually reads them. The idea is that by reading the heap from beginning to end rather than in index order it will do less random access -- all matching rows from a given page will be read when that page is loaded, and enough pages may be read in order to use cheaper sequential access rather than seeking back and forth all over the disk.

Why is it examining 425 rows?

It's not. You ran EXPLAIN, which just shows you estimates and the chosen plan, it doesn't really examine the rows at all. That makes the value of EXPLAIN rather limited compared to running EXPLAIN ANALYZE, which actually runs the query and shows you the estimates and the actual numbers.

Why is the width of the operation 321?

Apparently that's the size, in bytes, of the tuples in mytable.

What is the cost of 12.04..1632.35 and 0.00..11.93 telling me?

The first number is the cost to return the first row from that node; the second number is the cost to return all of the rows for that node. Remember, these are estimates. The unit is an abstract cost unit. The absolute number means nothing; what matters in planning is which plan has the lowest cost. If you are using a cursor the first number matters; otherwise it is usually the second number. (I think it interpolates for a LIMIT clause.)

It is often necessary to adjust configurable cost factors, such as random_page_cost and cpu_tuple_cost, to accurately model the costs within your environment. Without such adjustments the comparative costs are likely to not match the corresponding run times, so a less-than-optimal plan might be chosen.

like image 54
kgrittn Avatar answered Sep 17 '22 14:09

kgrittn


re 1) execution plans have to be read from the inner most node to the outermost node. So it's first doing an index scan (to find the rows) and the accessing the actual table to return the rows the index scan found

re 2) the number of rows shown in the plan is just an estimation based on the statistics and as such 425 vs. 773 sounds fairly reasonable. If you want to see real figures, use explain analyze

re 3) the first number in the cost figure is the "startup" cost to intialize the step of the planner, the second cost is the total cost of that step.

This is all documented in the manual: http://www.postgresql.org/docs/current/static/using-explain.html

You might want to go through these links in the PostgreSQL Wiki as well:

PostgreSQL EXPLAIN
Using Explain

like image 20
a_horse_with_no_name Avatar answered Sep 16 '22 14:09

a_horse_with_no_name