**Introduction:**

Scaling is a process that is used to fit a given number into the specified format by adjusting the dynamic range.

During addition if the result is beyond the most positive value in that range or if the result is below the most negative value in that range then overflow occurs. Hence when the result is out of range, normalization or saturation is used. Normalization is a process that takes care of overflow situation by scaling the result. Saturation is a process by which the maximum positive value or the maximum negative value is taken based on the overflow condition.

**Fixed point addition:**

Let us consider operands with A-Q13 and B-Q11 format respectively.

s | x
| x | x | x | .
| x | x | x | x | x | x | x | x | x | x | x |

s | x | x | . | x | x | x | x | x | x | x | x | x | x | x | x | x |

_{A}

_{B}

**Note:** Decimal point in Q-format notation is just an imaginary point.

**For fixed point addition two operands should be in same Q-format.

**Steps to perform fixed point addition**

1. Convert float to fixed point.

2. Make two operands in same Q-format.

3. Finally, performing arithmetic addition.

__Convert float to fixed point__

Example:

float a = 12.36, b = 3.12;

__1 ^{st} Method__

1. Normalizing two float values with no. of bits required to represent integer part Q_{I}

If ( Q_{I}a > Q_{I}b)

{

a = a / (2 ^ Q_{I}a);

b = b / (2 ^ Q_{I}a);

}

else

{

a = a / (2 ^ Q_{I}b);

b = b / (2 ^ Q_{I}b);

}

Q_{I}a = 4-bits, Q_{I}b = 2-bits (Q_{I}a > Q_{I}b)

a_norm = 12.36 / (2 ^ 4); a = 0.7725;

b_norm = 3.12 / (2 ^ 4); b = 0.195;

2. Converting normalized values to fixed point (Assuming word length WL = 16-bit)

a_fixed = (a_norm X (2^15)); = (0.7725 * 2^15) = 25313;

b_fixed = (b_norm X (2^15)); = (0.195 * 2^15) = 6389;

Now both the operands are in the same Q-format i.e. Q15 format.

3. Finally, perform arithmetic addition between the two operands which are in same Q-format.

addition_fixed = (25313 + 6389)

= (31702)

__2nd Method__

1. Find the max of Q_{I} between two operands

Q_{I}a = 4, Q_{I}b = 2.

2. Assuming WL is 16-bit,

Q-format = (16 – (4 + 1 sign bit)) = 11bit

a_fixed = (12.36 * 2 ^ 11) = (short)25313.

b_fixed = (3.12 * 2 ^ 11) = (short) 6389.

3. Finally, perform arithmetic addition b/n two operands which are in same

addition_fixed = (25313 + 6389)

= (31702)

Steps to perform fixed point subtraction:

Follow the Steps 1 & 2 as explained for fixed point addition

Finally perform subtraction on two fixed point operands which are in same format.

This comment has been removed by a blog administrator.

ReplyDelete