I am trying to calculate the cost of the (most efficient) block nested loop join in terms of NDPR (number of disk page reads). Suppose you have a query of the form:
SELECT COUNT(*)
FROM county JOIN mcd
ON count.state_code = mcd.state_code
AND county.fips_code = mcd.fips_code
WHERE county.state_code = @NO
where @NO is substituted for a state code on each execution of the query.
I know that I can derive the NPDR using: NPDR(R x S) = |Pages(R)| + Pages(R) / B - 2 . |P ages(S)|
(where the smaller table is used as the outer in order to produce less page reads. Ergo: R = county, S = mcd).
I also know that Page size = 2048 bytes
Pointer = 8 byte
Num. rows in mcd table = 35298
Num. rows in county table = 3141
Free memory buffer pages B = 100
Pages(X) = (rowsize)(numrows) / pagesize
What I am trying to figure out is how the "WHERE county.state_code = @NO
" affects my cost?
Thanks for your time.
The formula for calculating the number of times the inner loop executes is the number of times the outer loop repeats multiplied by the number of times the inner loop repeats.
A Block Nested-Loop (BNL) join algorithm uses buffering of rows read in outer loops to reduce the number of times that tables in inner loops must be read.
Nested-Loop Join Algorithm A simple nested-loop join (NLJ) algorithm reads rows from the first table in a loop one at a time, passing each row to a nested loop that processes the next table in the join. This process is repeated as many times as there remain tables to be joined.
Cost Analysis of Nested-loop Join Algorithm In the best case, both relations r and s have sufficient memory to fit in the memory simultaneously. So, each block will read only once. Thus, Total number of block transfers in best case = br + bs.
First a couple of observations regarding the formula you wrote:
I'm not sure why it you write "B - 2" instead of "B - 1". From a theoretical perspective, you need a single buffer page to read in relation S (you can do it by reading one page at a time).
Make sure you use all the brackets. I would write the formula as: NPDR(R x S) = |Pages(R)| + |Pages(R)| / (B-2) * |Pages(S)|
The all numbers in the formula would need to be rounded up (but this is nitpicking).
The explanation for the generic BNLJ formula:
You read in as many tuples from the smaller relation (R) as you can keep in memory (B-1 or B-2 pages worth of tuples).
For each group of B-2 pages worth of tuples, you then have to read the whole S relation ( |Pages(S)|) to perform the join for that particular range of relation R.
At the end of the join, relation R is read exactly one time and relation S is read as many times as we filled the memory buffer, namely |Pages(R)| / (B-2)
times.
Now the answer:
In your example a selection criteria is applied to relation R (table Country in this case). This is the WHERE county.state_code = @NO
part of the query. Therefore, the generic formula does not apply directly.
When reading from relation R (i.e., table Country in your example), we can discard all the non-qualifying tuples that do not match the selection criteria. Assuming that there are 50 states in the USA and that all states have the same number of counties, only 2% of the tuples in table Country qualify on average and need to be stored in memory. This reduces the number of iteration of the inner loop of the join (i.e., the number of times we need to scan relation S / table mcs). The 2% number is obviously just the expected average and will change depending on the actual given state.
The formula for your problem therefore becomes: NPDR(R x S) = |Pages(County)| + |Pages(County)| / (B - 2) * |Counties in state @NO| / |Rows in table County| * |Pages(Mcd)|
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With