Determine the critical path in the data flow

Question

In the book Computer Systems: A Programmer's Perspective, the Exercise 5.5 shows a piece of code to compute the value of a polynomial

double poly(double a[], double x, int degree)
{
    long int i;
    double result = a[0];
    double xpwr = x;
    for (i = 1; i <= degree; i++) {
        result += a[i] * xpwr;
        xpwr = x * xpwr;
    }
    return result;
}

The exercise assumes that the clock cycles that are needed by double-precision floating-point addition and multiplication are 3 and 5 respectively. The reader is asked to explain why the measured CPE (Cycles Per Element) value is 5.

According to the exercise answer, in each iteration, we need to update the variables xpwr and result, and the operations we need is a floating-point addition (for result) and a floating-point multiplication (for xpwr), so the latter dominates the latency, causing the ultimate CPE to be 5.

But I think the data flow should be something like this:

xpwr               result
  |                  |
  +-----+ +--[load]  |
  |     | |          |
[mul]  [mul]         |
  |      |           |
  |      +---+ +-----+
  |          | |
  |         [add]
  |           |
  |           +------+
  |                  |
xpwr               result

So the longest path is from the previous value of xpwr to the new value of result, going through the execution units [mul] and [add]. Therefore the longest time should be 8 cycles.

I want to ask

What is exactly the meaning of a critical path? And how to determine it?
Which answer (mine and the book) is more reasonable?

Any explanation about the CPU, architecture, execution units, pipeline, floating-point unit will be appreciated.

Christos Christidis · Accepted Answer

I know I'm a bit late to the party, but the book is absolutely right. As you can verify for yourselves by timing the code, CPE is indeed 5, so the second answer is wrong.

But the first one is also wrong. It says that the MULs must be performed at the same time which is not at all possible in the Nehalem architecture (and I suspect, most modern processors). Remember that there's only one FP MUL unit and a different FP ADD unit (as shown in the book, ed. 2011 and later)

What happens instead is this:

(LOADs are assumed always present, just 1 cycle if in cache)

First we feed xpwr *= x in MUL. Immediately after that we feed xpwr*a[i] (remember the pipeline!)

... after 5 cycles we'll get the new value of xpwr, and after 6 cycles we'll have the result of xpwr*a[i]. At that point, a new computation of xpwr *= x will be at stage 1 of MUL. So we have only 4 more cycles in which to do the rest of the ops if we don't want to be restricted by them.

Of course, that is easy as we only need 3 cycles for the FP ADD to get the new result.

So, it becomes obvious that the limiting factor is the computation of xpwr. Which means that in looking for the critical path (whatever that is) we have to look specifically at the path from the old values to the new. In this case, the path for result consists only of one FP ADD! (that's what threw me off at first too)

Opundo · Answer

The critical path is indeed 8 cycles, but the question asks for the CPE, which is like the time-averaged time to output one more cycle of the loop.

Other than the first and the last cycle, the processor can do the addition from the previous iteration of the loop and the current multiplications at the same time, because the operands are not dependent on each other. The first iteration of the loop takes the full 8 cycles, but all iterations after, the loop only takes 5 cycles to run, making the actual CPE 5 cycles.

P.S. I do agree that the book's way of presenting critical path is confusing. Their definition of the critical path is not simply the path that takes the longest path, but the path also has to have operations that have operands that depends on previous operations and so have to be in order. This definition makes finding the critical path rather not intuitive.

Lin · Answer

A1: The critical path is the longest one in the data flow graph according to the book, which must be on a straight line and has an effect on a single register, rather than add up 'mul' and 'add', whose results are only intermediate operands for the next operations.

About this question, you'll get it all done if continuing to read the rest. Especially, comparing combine7's data flow graph and combine5's one can be helpful.

A2: If A1 is understood, Question2 is clear, the answer in book is reasonable.

Determine the critical path in the data flow

Tags:

cpu-architecture

cpu

computer-architecture

cpu-cycles

pjhades

3 Answers

Christos Christidis

Opundo

Lin

Recent Activity

Donate For Us

Determine the critical path in the data flow

Tags:

cpu-architecture

cpu

computer-architecture

cpu-cycles

pjhades

3 Answers

Christos Christidis

Opundo

Lin

Related questions

Recent Activity

Donate For Us