Introduction to Floating point arithmetic
Floating point addition:
How floating addition works? Here are the steps involve in floating point addition.
Assume both the operands are in IEEE 754 floating point format. Performing floating point addition between ‘A’ and ‘B’
i.e. A = (-1) ^S x 2^Ae x 1.Am
S = sign bit,
Ae = Exponent of operand A i.e. E – 127 (assuming Single precision),
Am = Mantissa. Similarly for operand B (S, Be, Bm).
Therefore A + B = (Am x 2^Ae + Bm x 2^Be).
Step by step procedure explaining floating point addition:
1. Aligning binary points of A and B
a. Compare Ae and Be. Take the lager and compute exponent
difference Be – Ae (Be > Ae).
2. If Ae > Be right shift Bm that many positions to form Bm x 2^(Be – Ae).
If Be > Ae right shift Am that many positions to form Am x 2^(Ae – Be).
3. Compute the sum of aligned mantissa i.e.
Bm x 2^(Be – Ae) + Am (or) Am x 2^(Be – Ae) + Bm
4. If normalization of result is need, steps to perform
a. If result looks like (0.001001…) then reduce the exponent by
left shifting the result.
b. If result looks like (101.01001…) then increase the exponent by
right shifting the result.
Continue above steps (a or b) until MSB (hidden bit in IEEE 754 standard) is 1.
5. Check result exponent.
a. If larger than allowed exponent allowed return exponent overflow.
b. If smaller than allowed exponent allowed return exponent underflow.
6. If mantissa is equal to zero set exponent to zero.