Floating point representation - II

Representation of floating point numbers:

In the IEEE Single-precision representation of a real number, one bit used to represent sing , and it is set 0 for positive number and 1 for negative one. A representation of the exponent is stored in next 8bits and the remaining twenty-three bits are occupied by a representation of the mantissa of the number

Here are some examples:
How to represent real numbers in floating point format:

Examples:

1. Representing 23/4 in single precision floating point number.
=> 23/4 = 5.75
Converting above real number to binary form
=> 101.11 (5 in binary 101, .(2^-1 + 2^-2) = .75)

Representing above binary to SP floating point format (32bit)
[(-1)^S x 2^(E – 127) x 1.M]
=> 1.0111 x 2^2 relating this to above given equation

(Numeric ‘1’ before decimal point is called hidden bit as it is by default given in representation).

Sing S = 0; No. of bits used to represent exponent = 1
Exponent (E – 127) = 2 i.e. E = 129; No. of bits used to represent exponent = 8
Mantissa M = 0111000…. ; No. of bits used to represent Mantissa = 23

Finally 5.75 in SP floating point representations is as shown below 0|10000001|01110000000000000000000

Note:
What if the fraction part of a real number cannot be expressed as sum of powers of two (as in the above example .75 = (1/2 + 1/4) ex: 7/5 is exactly 1.4, .4 cannot be expressed in terms of sums of power two, 7/5 has infinity binary expansion 1.011001100110011001100.
In a single precision representation, the expansion is rounded off at the twenty-third digit after the binary point.

2. Extracting real number from SP floating point number representation
11000100000100110000000000000000

1|10001000|0100110000000000000000
S|-----E----|-------------M---------------|

Sign = 1 i.e (-1)1 = -1 negative number
Exponent (10001000) = 127 + e, 136 = 127 + e i.e. exponent = 9;
Mantissa = 1.01001100000000000000000

i.e. Mantissa = (one plus, plus no one halves, plus one quarter (/14), plus no one eight, plus no one sixteenth, plus one thirty second, plus one sixty fourth,…all zeros)

=> (1 + 1/4 + 1/32 + 1/64) = X
=> (64 + 16 + 2 + 1) = X x 64;
=> X = 83/64;

So the complete number = -(83/64) x 29 = -664.00;
Introduction to Floating point representation IEEE 754
Ask your questions below.
Previous Next

No comments:

Post a Comment

Related Posts

Twitter Updates

Random Posts

share this post
Bookmark and Share
| More
Share/Save/Bookmark Share