A Machine Learning and Data Science Blog: Fixed point arithmetic

Introduction:

We know that a given DSP has a finite word length and each number has to be represented in that word length only. Due to the finite word length two numbers of different precision i.e. two numbers in different formats can’t be added directly. Hence given numbers for scaled down to same format.
Scaling is a process that is used to fit a given number into the specified format by adjusting the dynamic range.

During addition if the result is beyond the most positive value in that range or if the result is below the most negative value in that range then overflow occurs. Hence when the result is out of range, normalization or saturation is used. Normalization is a process that takes care of overflow situation by scaling the result. Saturation is a process by which the maximum positive value or the maximum negative value is taken based on the overflow condition.

Fixed point addition:

Let us consider operands with A-Q13 and B-Q11 format respectively.

Note: Decimal point in Q-format notation is just an imaginary point.

**For fixed point addition two operands should be in same Q-format.

Steps to perform fixed point addition

1. Convert float to fixed point.

2. Make two operands in same Q-format.

3. Finally, performing arithmetic addition.

Convert float to fixed point

Example:

float a = 12.36, b = 3.12;

1^st Method

1. Normalizing two float values with no. of bits required to represent integer part Q_I

If ( Q_Ia > Q_Ib)

{

a = a / (2 ^ Q_Ia);

b = b / (2 ^ Q_Ia);

}

else

{

a = a / (2 ^ Q_Ib);

b = b / (2 ^ Q_Ib);

}

Q_Ia = 4-bits, Q_Ib = 2-bits (Q_Ia > Q_Ib)

a_norm = 12.36 / (2 ^ 4); a = 0.7725;

b_norm = 3.12 / (2 ^ 4); b = 0.195;

2. Converting normalized values to fixed point (Assuming word length WL = 16-bit)

a_fixed = (a_norm X (2^15)); = (0.7725 * 2^15) = 25313;

b_fixed = (b_norm X (2^15)); = (0.195 * 2^15) = 6389;

Now both the operands are in the same Q-format i.e. Q15 format.

3. Finally, perform arithmetic addition between the two operands which are in same Q-format.

addition_fixed = (25313 + 6389)

= (31702)

2nd Method

1. Find the max of Q_I between two operands

Q_Ia = 4, Q_Ib = 2.

2. Assuming WL is 16-bit,

Q-format = (16 – (4 + 1 sign bit)) = 11bit

a_fixed = (12.36 * 2 ^ 11) = (short)25313.

b_fixed = (3.12 * 2 ^ 11) = (short) 6389.

3. Finally, perform arithmetic addition b/n two operands which are in same Q- format.

addition_fixed = (25313 + 6389)

= (31702)

Steps to perform fixed point subtraction:

Follow the Steps 1 & 2 as explained for fixed point addition

Finally perform subtraction on two fixed point operands which are in same format.

A Machine Learning and Data Science Blog

Fixed point arithmetic - Fixed point addition

1 comment:

Related Posts

Twitter Updates

Random Posts

Disclaimer

Recent Comments