My teacher tends to gave us a problem every year on the finals and seems that nobody gave him the expected result.Personally I don't have any clue how to solve it. Here is the problem
Let's consider an array of constants A[a0 a1 a2 a3 a4 a5 a6 a7]
, in which each element is a natural number on 16 bit and an array U of elements procured in real time U=[u0 u1 u2 u3 u4 u5 u6 u7]
in which each element is left aligned and is represented on 12 bits.The dot product of the two vectors is Y=A*U^
where ^
is the transpose operator.
a) Write the sequence of instruction for calculating the dot product Y considering all the numeric values available at consecutive addresses.Considering for each instruction a cycle machine like execution time, evaluate the time of execution of the Y.The final result will be stored in the general registers.
b) Explain the components of the hardware block of a DSP microprocessor which permit the lowering execution time for the Y.
From the scale of correction I could find :
a)
The memory management of the list of coefficients and circular buffers for sampling) 1p
The administration of the addressing pointers 0.5p
The multiply and addition operations ( size operands and dimension of result) 1p The cyclic execution for obtaining the result 0.5 p.
b)
zero overhead
1pFor the first task I have some ideas.He gave us an hint telling us that even if the U values are on 12 bits the 8086 processor will fetch 16 bits, and this seems to be the thing that all the others students doesn't seem to observe.For the second item I have no clue.
Some general guidelines:
CL
register. A series of single shifts/rotates is much faster.This is a version of the dot product calculation:
xor bx, bx ;3
xor cx, cx ;3
mov si, 14 ;4
Again:
mov ax, [U + si] ;8 + EA (=9)
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
mul word ptr [A + si] ;124-139 + EA (=9)
add bx, ax ;3
adc cx, dx ;3
sub si, 2 ;4
jnb Again ;16 if taken, 4 if not taken
mov ax, bx ;2
mov dx, cx ;2
Since for the U array "each element is left aligned and is represented on 12 bits" the series of shifts normalizes the value.
By iterating starting at the end of both arrays a cmp
was avoided on the loop control.
Moving the result to DX:AX seems more natural. To be removed if not needed.
Because mul
exhibits variable execution times there are 2 cases to consider:
Partially unrolling will reveal a 5% speed increase at the expense of less compact code (from 36 bytes to 56 bytes).
xor bx, bx ;3
xor cx, cx ;3
mov si, 10 ;4
Again:
mov ax, [U + si + 2] ;8 + EA (=9)
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
mul word ptr [A + si + 2] ;124-139 + EA (=9)
add bx, ax ;3
adc cx, dx ;3
mov ax, [U + si] ;8 + EA (=9)
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
shr ax, 1 ;2
mul word ptr [A + si] ;124-139 + EA (=9)
add bx, ax ;3
adc cx, dx ;3
sub si, 4 ;4
jnb Again ;16 if taken, 4 if not taken
mov ax, bx ;2
mov dx, cx ;2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With