Representation of floating point numbers:
In the IEEE Single-precision representation of a real number, one bit used to represent sing , and it is set 0 for positive number and 1 for negative one. A representation of the exponent is stored in next 8bits and the remaining twenty-three bits are occupied by a representation of the mantissa of the number
Here are some examples:
How to represent real numbers in floating point format:
Examples:
1. Representing 23/4 in single precision floating point number.
=> 23/4 = 5.75
Converting above real number to binary form
=> 101.11 (5 in binary 101, .(2^-1 + 2^-2) = .75)
Representing above binary to SP floating point format (32bit)
[(-1)^S x 2^(E – 127) x 1.M]
=> 1.0111 x 2^2 relating this to above given equation
(Numeric ‘1’ before decimal point is called hidden bit as it is by default given in representation).
Sing S = 0; No. of bits used to represent exponent = 1
Exponent (E – 127) = 2 i.e. E = 129; No. of bits used to represent exponent = 8
Mantissa M = 0111000…. ; No. of bits used to represent Mantissa = 23
Finally 5.75 in SP floating point representations is as shown below 0|10000001|01110000000000000000000
Note: What if the fraction part of a real number cannot be expressed as sum of powers of two (as in the above example .75 = (1/2 + 1/4) ex: 7/5 is exactly 1.4, .4 cannot be expressed in terms of sums of power two, 7/5 has infinity binary expansion 1.011001100110011001100.
In a single precision representation, the expansion is rounded off at the twenty-third digit after the binary point.
2. Extracting real number from SP floating point number representation
11000100000100110000000000000000
1|10001000|0100110000000000000000
S|-----E----|-------------M---------------|
Sign = 1 i.e (-1)1 = -1 negative number
Exponent (10001000) = 127 + e, 136 = 127 + e i.e. exponent = 9;
Mantissa = 1.01001100000000000000000
i.e. Mantissa = (one plus, plus no one halves, plus one quarter (/14), plus no one eight, plus no one sixteenth, plus one thirty second, plus one sixty fourth,…all zeros)
=> (1 + 1/4 + 1/32 + 1/64) = X
=> (64 + 16 + 2 + 1) = X x 64;
=> X = 83/64;
So the complete number = -(83/64) x 29 = -664.00;
Introduction to Floating point representation IEEE 754
Ask your questions below.
Previous Next
No comments:
Post a Comment