Introduction:
Scaling is a process that is used to fit a given number into the specified format by adjusting the dynamic range.
During addition if the result is beyond the most positive value in that range or if the result is below the most negative value in that range then overflow occurs. Hence when the result is out of range, normalization or saturation is used. Normalization is a process that takes care of overflow situation by scaling the result. Saturation is a process by which the maximum positive value or the maximum negative value is taken based on the overflow condition.
Fixed point addition:
Let us consider operands with A-Q13 and B-Q11 format respectively.
s | x
| x | x | x | .
| x | x | x | x | x | x | x | x | x | x | x |
s | x | x | . | x | x | x | x | x | x | x | x | x | x | x | x | x |
A
B
Note: Decimal point in Q-format notation is just an imaginary point.
**For fixed point addition two operands should be in same Q-format.
Steps to perform fixed point addition
1. Convert float to fixed point.
2. Make two operands in same Q-format.
3. Finally, performing arithmetic addition.
Convert float to fixed point
Example:
float a = 12.36, b = 3.12;
1st Method
1. Normalizing two float values with no. of bits required to represent integer part QI
If ( QIa > QIb)
{
a = a / (2 ^ QIa);
b = b / (2 ^ QIa);
}
else
{
a = a / (2 ^ QIb);
b = b / (2 ^ QIb);
}
QIa = 4-bits, QIb = 2-bits (QIa > QIb)
a_norm = 12.36 / (2 ^ 4); a = 0.7725;
b_norm = 3.12 / (2 ^ 4); b = 0.195;
2. Converting normalized values to fixed point (Assuming word length WL = 16-bit)
a_fixed = (a_norm X (2^15)); = (0.7725 * 2^15) = 25313;
b_fixed = (b_norm X (2^15)); = (0.195 * 2^15) = 6389;
Now both the operands are in the same Q-format i.e. Q15 format.
3. Finally, perform arithmetic addition between the two operands which are in same Q-format.
addition_fixed = (25313 + 6389)
= (31702)
2nd Method
1. Find the max of QI between two operands
QIa = 4, QIb = 2.
2. Assuming WL is 16-bit,
Q-format = (16 – (4 + 1 sign bit)) = 11bit
a_fixed = (12.36 * 2 ^ 11) = (short)25313.
b_fixed = (3.12 * 2 ^ 11) = (short) 6389.
3. Finally, perform arithmetic addition b/n two operands which are in same
addition_fixed = (25313 + 6389)
= (31702)
Steps to perform fixed point subtraction:
Follow the Steps 1 & 2 as explained for fixed point addition
Finally perform subtraction on two fixed point operands which are in same format.
This comment has been removed by a blog administrator.
ReplyDelete