Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dot Product on 8086 and DSP microprocessor

My teacher tends to gave us a problem every year on the finals and seems that nobody gave him the expected result.Personally I don't have any clue how to solve it. Here is the problem

Let's consider an array of constants A[a0 a1 a2 a3 a4 a5 a6 a7], in which each element is a natural number on 16 bit and an array U of elements procured in real time U=[u0 u1 u2 u3 u4 u5 u6 u7] in which each element is left aligned and is represented on 12 bits.The dot product of the two vectors is Y=A*U^ where ^ is the transpose operator.

a) Write the sequence of instruction for calculating the dot product Y considering all the numeric values available at consecutive addresses.Considering for each instruction a cycle machine like execution time, evaluate the time of execution of the Y.The final result will be stored in the general registers.

b) Explain the components of the hardware block of a DSP microprocessor which permit the lowering execution time for the Y.

From the scale of correction I could find :

a)

  1. The memory management of the list of coefficients and circular buffers for sampling) 1p

  2. The administration of the addressing pointers 0.5p

  3. The multiply and addition operations ( size operands and dimension of result) 1p The cyclic execution for obtaining the result 0.5 p.

b)

  1. Different management oof memory, parallel hardware management of the pointers 1p
  2. Instruction of type Multiply and Add 0.5 p
  3. Multiple instruction for parallel execution 0.5 p
  4. Instruction cyclic of type zero overhead 1p
  5. Interruption request from a timer for the generation of sampling period 1p
  6. The procure of the current sample 1p
  7. The evaluation of the subroutine of interruption. 0.5p
  8. The relation between the sampling period and the execution time of the interruption routine. 0.5p

For the first task I have some ideas.He gave us an hint telling us that even if the U values are on 12 bits the 8086 processor will fetch 16 bits, and this seems to be the thing that all the others students doesn't seem to observe.For the second item I have no clue.

like image 529
Alberto12 Avatar asked Jun 24 '16 02:06

Alberto12


1 Answers

Some general guidelines:

  • Avoid having to use a segment override prefix to address the data. On a 8086 such a prefix incurs a 2 clock penalty.
  • Make sure the data is word aligned. The 8086 has a penalty of 4 clocks when addressing a word on an odd address.
  • Don't use shifting/rotating with a count in the CL register. A series of single shifts/rotates is much faster.
  • Even if the final result needs to be in memory, refrain from using that memory repeatedly in the calculations. Use temporary registers and transfer the result only at the end.

This is a version of the dot product calculation:

    xor  bx, bx                ;3
    xor  cx, cx                ;3
    mov  si, 14                ;4
Again:
    mov  ax, [U + si]          ;8 + EA (=9)
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    mul  word ptr [A + si]     ;124-139 + EA (=9)
    add  bx, ax                ;3
    adc  cx, dx                ;3
    sub  si, 2                 ;4
    jnb  Again                 ;16 if taken, 4 if not taken
    mov  ax, bx                ;2
    mov  dx, cx                ;2

Since for the U array "each element is left aligned and is represented on 12 bits" the series of shifts normalizes the value.
By iterating starting at the end of both arrays a cmp was avoided on the loop control.
Moving the result to DX:AX seems more natural. To be removed if not needed.

Because mul exhibits variable execution times there are 2 cases to consider:

  • Best case executing time is : 10 + (168 + 16) * 7 + (168 + 4) + 4 = 1474 clocks
  • Worst case executing time is : 10 + (183 + 16) * 7 + (183 + 4) + 4 = 1594 clocks

Partially unrolling will reveal a 5% speed increase at the expense of less compact code (from 36 bytes to 56 bytes).

    xor  bx, bx                ;3
    xor  cx, cx                ;3
    mov  si, 10                ;4
Again:
    mov  ax, [U + si + 2]      ;8 + EA (=9)
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    mul  word ptr [A + si + 2] ;124-139 + EA (=9)
    add  bx, ax                ;3
    adc  cx, dx                ;3
    mov  ax, [U + si]          ;8 + EA (=9)
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    shr  ax, 1                 ;2
    mul  word ptr [A + si]     ;124-139 + EA (=9)
    add  bx, ax                ;3
    adc  cx, dx                ;3
    sub  si, 4                 ;4
    jnb  Again                 ;16 if taken, 4 if not taken
    mov  ax, bx                ;2
    mov  dx, cx                ;2
  • Best case executing time is : 10 + (332 + 16) * 3 + (332 + 4) + 4 = 1394 clocks
  • Worst case executing time is : 10 + (362 + 16) * 3 + (362 + 4) + 4 = 1514 clocks
like image 68
Sep Roland Avatar answered Nov 12 '22 12:11

Sep Roland