I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which <code>A(n,m)</code> designates a signed fixed-point format with <code>n</code> integer bits, <code>m</code> fraction bits, and <code>n + m + 1</code> bits overall. Short question: How exactly is a <code>A(a,b)*A(c,d)</code> and <code>A(a,b)+A(c,d)</code> carried out when <code>a</code> != <code>c</code> and <code>b</code> != <code>d</code>? Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to <code>A(15,16)</code>, and the twiddle factors (tw) are scaled to <code>A(2,29)</code>. Both are stored as <code>int</code>s. Something like this: <pre class="prettyprint"><code>float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG; int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits))); </code></pre> And similarly for the twiddle factors. Now I need to perform <ol> <li><code>res = a*tw</code> Questions: a) how do I implement this? b) Should the size of <code>res</code> be 64 bit? c) can I make 'res' A(17,14) since I know the ranges of <code>a</code> and <code>tw</code>? if yes, should I be scaling <code>a*tw</code> by 2^14 to store correct value in <code>res</code>? </li> <li><code>a + res</code> Questions: a) How do I add these two numbers of different Q formats? b) if not, how do I do this operation?</li> </ol>

Maybe it's easiest to make an example. Suppose you want to add two numbers, one in the format <code>A(3, 5)</code>, and the other in the format <code>A(2, 10)</code>. You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part. A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to <code>A(3, 10)</code> by shifting it 5 bits left. Then, add the second number. The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add <code>A(3, 10)</code> and <code>A(2, 10)</code>, the result has the format <code>A(4, 10)</code>. I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas: <ol> <li>You can use the less-accurate format as your common representation. In my example, you can convert the second number to <code>A(2, 5)</code> by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.</li> <li>You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.</li> </ol> <hr> Now, on multiplication. It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying <code>A(3, 5)</code> with <code>A(2, 10)</code> gives a number in the format <code>A(7, 15)</code>. This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits. <hr> In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits. For example, multiplying <code>A(17, 14)</code> with <code>A(2, 29)</code> gives <code>A(20, 43)</code> - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying <code>-1 << 31</code> by <code>-1 << 31</code> - it's almost always worth rejecting this edge-case). So use <code>A(19, 12)</code> for your output format, i.e. remove 31 bits from the fractional part of your output. So, instead of <pre class="prettyprint"><code>res = a*tw; </code></pre> you probably want <pre class="prettyprint"><code>int64_t res_tmp = (int64_t)a * tw; // A(20, 43) if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case --res_tmp; // A(19, 43) int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12) </code></pre>

understanding Fixed point arithmetic

Tags:

c

fixed-point

I am struggling with how to implement arithmetic on fixed-point numbers of different precision. I have read the paper by R. Yates, but I'm still lost. In what follows, I use Yates's notation, in which A(n,m) designates a signed fixed-point format with n integer bits, m fraction bits, and n + m + 1 bits overall.

Short question: How exactly is a A(a,b)*A(c,d) and A(a,b)+A(c,d) carried out when a != c and b != d?

Long question: In my FFT algorithm, I am generating a random signal having values between -10V and 10V signed input(in) which is scaled to A(15,16), and the twiddle factors (tw) are scaled to A(2,29). Both are stored as ints. Something like this:

float temp = (((float)rand() / (float)(RAND_MAX)) * (MAX_SIG - MIN_SIG)) + MIN_SIG;
int in_seq[i][j] = (int)(roundf(temp *(1 << numFracBits)));

And similarly for the twiddle factors.

Now I need to perform

res = a*tw
Questions:
a) how do I implement this?
b) Should the size of res be 64 bit?
c) can I make 'res' A(17,14) since I know the ranges of a and tw? if yes, should I be scaling a*tw by 2^14 to store correct value in res?
a + res
Questions:
a) How do I add these two numbers of different Q formats?
b) if not, how do I do this operation?

772

asked Jun 27 '16 13:06

py sqr

1 Answers

Maybe it's easiest to make an example.

Suppose you want to add two numbers, one in the format A(3, 5), and the other in the format A(2, 10).

You can do it by converting both numbers to a "common" format - that is, they should have the same number of bits in the fractional part.

A conservative way of doing that is to choose the greater number of bits. That is, convert the first number to A(3, 10) by shifting it 5 bits left. Then, add the second number.

The result of an addition has the range of the greater format, plus 1 bit. In my example, if you add A(3, 10) and A(2, 10), the result has the format A(4, 10).

I call this the "conservative" way because you cannot lose information - it guarantees that the result is representable in the fixed-point format, without losing precision. However, in practice, you will want to use smaller formats for your calculation results. To do that, consider these ideas:

You can use the less-accurate format as your common representation. In my example, you can convert the second number to A(2, 5) by shifting the integer right by 5 bits. This will lose precision, and usually this precision loss is not problematic, because you are going to add a less-precise number to it anyway.
You can use 1 fewer bit for the integer part of the result. In applications, it often happens that the result cannot be too big. In this case, you can allocate 1 fewer bit to represent it. You might want to check if the result is too big, and clamp it to the needed range.

Now, on multiplication.

It's possible to multiply two fixed-point numbers directly - they can be in any format. The format of the result is the "sum of the input formats" - all the parts added together - and add 1 to the integer part. In my example, multiplying A(3, 5) with A(2, 10) gives a number in the format A(7, 15). This is a conservative rule - the output format is able to store the result without loss of precision, but in applications, almost always you want to cut the precision of the output, because it's just too many bits.

In your case, where the number of bits for all numbers is 32, you probably want to lose precision in such a way that all intermediate results have 32 bits.

For example, multiplying A(17, 14) with A(2, 29) gives A(20, 43) - 64 bits required. You probably should cut 32 bits from it, and throw away the rest. What is the range of the result? If your twiddle factor is a number up to 4, the result is probably limited by 2^19 (the conservative number 20 above is needed to accommodate the edge case of multiplying -1 << 31 by -1 << 31 - it's almost always worth rejecting this edge-case).

So use A(19, 12) for your output format, i.e. remove 31 bits from the fractional part of your output.

So, instead of

res = a*tw;

you probably want

int64_t res_tmp = (int64_t)a * tw;      // A(20, 43)
if (res_tmp == ((int64_t)1 << 62)) // you might want to neglect this edge case
    --res_tmp;                          // A(19, 43)
int32_t res = (int32_t)(res_tmp >> 31); // A(19, 12)

146

answered Sep 27 '22 19:09

anatolyg

Related questions
                            
                                Lua C API Memory Leak? (valgrind)
                            
                                Is there an equivalent of O_TMPFILE for directories?
                            
                                Find consecutive rows of a 2D array with maximum number of 1s
                            
                                How to get a list of all global declarations of a C/C++ program using Clang?
                            
                                working with directories in POSIX with C
                            
                                Adding a new system call to Linux that requires root privileges
                            
                                How to make a thread wake up and suspend for 500ms continuously in C
                            
                                Recursion if Condition
                            
                                Do I need to realloc after memmove when remove an element from dynamic array?
                            
                                Recursive coroutines in C (C99) [closed]
                            
                                How to get address information from library to be shared among all processes?
                            
                                Odd "undefined reference" error
                            
                                How to calculate run time for fork() system call in C?
                            
                                Different behaviour while comparing double [duplicate]
                            
                                How can you calculate a factor if you have the other factor and the product with overflows?
                            
                                Calling mmap on dumbbuffer with Linux’ Direct Rendering Manager in Rust fails while working in C
                            
                                C: Accessing (int) via (void *) pointing to (void *) pointing to (int) by typecasting and dereferencing
                            
                                If a standard library function is reimplemented, which of the two is called?
                            
                                How to initialize an array variable inside a structure?
                            
                                Opaque struct without definition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With