A Machine Learning and Data Science Blog: July 2009

Introduction to Fixed Point representation

Tutorials on Fixed Point representation & arithmetic:

Index

What is fixed point number?
Why fixed point representations.
Fixed point representation, Q-formats.
Finally, most important one

How to select Q-format for a given floating code.
How to convert floating point to fixed point numbers
Example source code, explaining floating to fixed point conversion’s
Notes on over flow, underflow and saturation conditions.

What is Fixed-Point?
In simple words, representing real number using Integer data type and hardware. Fixed-point is an alternative to float-point, a more well known method in Embedded and Digital Signal Processing filed (Audio, Video, Image codecs etc).

Why Fixed-point
Floating point offers great range of values and more precision, but it also significantly more expensive when compared to integer or fixed point math.

With fixed point representation Embedded systems and Digital Signal processing designers can achieve greater speed and reduce the hardware design cost.

Note: Designer need to make decision on tolerance of quantization/truncation errors occur in fixed point implementation.

Floating point arithmetic - II

Floating point subtraction:
How its works? Here are the steps involve in floating point subtraction.
Assume both the operands are in IEEE 754 floating point format. Performing floating point subtraction between ‘A’ and ‘B’
Single Precision, floating point representation. A = (-1)^ S x 2^Ae x Am
S = sign bit,
Ae = Exponent of operand A i.e. E – 127 (assuming Single precision),
Am = Mantissa. Similarly for operand B (S, Be, Bm).

Therefore A - B = (Am x 2^Ae - Bm x 2^Be).

Step by step procedure explaining floating point subtraction:
1. Aligning binary points of A and B
a. Compare Ae and Be. Take the lager and compute exponent difference Be – Ae (Be > Ae).
2. If Ae > Be right shift Bm that many positions to form Bm x 2^(Be – Ae).(or) If Be > Ae right shift Am that many positions to form Am x 2^(Ae – Be).
3. Compute the sum of aligned mantissa i.e.
Bm x 2^(Be – Ae) - Am (or) Am x 2^(Be – Ae) - Bm
4. If normalization of result is need, steps to perform
a. If result looks like (0.001001…) then reduce the exponent by left shifting the result.
b. If result looks like (101.01001…) then increase the exponent by right shifting the result.
Continue above steps (a or b) until MSB (hidden bit in IEEE 754 standard) is 1.
5. Check result exponent.
a. If larger than allowed exponent allowed return exponent overflow.
b. If smaller than allowed exponent allowed return exponent underflow.
6. If mantissa is equal to zero set exponent to zero.

Floating point multiplication:
How it works? Here are the steps involve in floating point multiplication.
Assume both the operands are in IEEE 754 floating point format. Performing floating point multiplication between ‘A’ and ‘B’

Single Precision, floating point representation. A = (-1)^ (AS) x 2^Ae x Am
Step by step procedure explaining floating point subtraction
1. If any of the operands are equal to zero, return result as zero.
2. Compute the sign: AS XOR BS
3. Multiply the mantissa’s : Am x Bm, and round it to allowed number of mantissa bits.
4. Compute the exponent of the result.
a. Result exponent = biased exponent A + biased exponent B – bias;
5. Normalize the result shift the mantissa, increment result exponent if needed
6. Check the result exponent:
a. If larger than maximum exponent allowed then return overflow.
b. If smaller than maximum exponent allowed then return underflow.

Denormalized floating point numbers

What are denormalized floating point numbers? How are they represented?

If you have noticed, from the previous discussion on floating point representation (Click here - previous discussion) there are few serious concern’s in the IEEE 754 representation itself.

(-1)^S x 2^(E-127) x 1.M

IEEE 754 Sing precision floating point representation

How to represent 0.0 in IEEE 754 floating point representation? It is not possible to represent zero, as the product of power of two and mantissa greater than or equal to one.

So how we represent 0.0 then?

Here is the explanation: In IEEE representation all zero ‘E’ exponent is used to represent numbers close to zero (closer to 2^-126 SP floating point representation), which is the least positive real number in the part of the system that can be represented as discussed in earlier posts (Click here – Earlier posts).

i.e. 0|00000001|00000000000000000000000
S|-----E-----|---------------M---------------|

This kind of numbers (Closer to zero) are represented in slightly different way.

Keeping the exponent always equal to -126, mantissa number greater than or equal to zero and less than one (i.e. 0.M instead of 1.M)

Here is the example how to represent number very close to zero:

Consider => 5 x 2^-129

Mantissa used to represent the above number is as explained

=> [5 / (2^3)] x 2^-126

=> [0.625 x 2^-126]

=> [0.625 x 2^-126]

=> [(1/2 + 1/8) x 2^-126] = (0.101) x 2^-126

So the representation of 5 x 2^-129 is as shown below

0|00000001|01010000000000000000000|
S|--E-8bit---|0.M-----------23bit-----------|

Mantissa less than one are said to be Denormalized number

Denormalized numbers are stored less accurately than normalized numbers.
So, the least positive real number that can be represented is 2-149 as shown below.

For Single precision

(-1) ^S x 2^(E – 127) x 0.M

Substitute S = 0, E-127 = -127 i.e. E = 0; and M (23 bit) i.e. 2^-23

So the least positive real number = 2^-(127 + 23) = 2^-149

i.e. 0|00000000|00000000000000000000001
S|---E-8bit--|0.M----------23bit-------------|

Topics to come: Introduction to Floating point arithmetic’s (addition, multiplication, division)

Previous

Floating point representation - II

Representation of floating point numbers:

In the IEEE Single-precision representation of a real number, one bit used to represent sing , and it is set 0 for positive number and 1 for negative one. A representation of the exponent is stored in next 8bits and the remaining twenty-three bits are occupied by a representation of the mantissa of the number

Here are some examples:
How to represent real numbers in floating point format:

Examples:
1. Representing 23/4 in single precision floating point number.
=> 23/4 = 5.75
Converting above real number to binary form
=> 101.11 (5 in binary 101, .(2^-1 + 2^-2) = .75)

Representing above binary to SP floating point format (32bit)
[(-1)^S x 2^(E – 127) x 1.M]
=> 1.0111 x 2^2 relating this to above given equation

(Numeric ‘1’ before decimal point is called hidden bit as it is by default given in representation).

Sing S = 0; No. of bits used to represent exponent = 1
Exponent (E – 127) = 2 i.e. E = 129; No. of bits used to represent exponent = 8
Mantissa M = 0111000…. ; No. of bits used to represent Mantissa = 23

Finally 5.75 in SP floating point representations is as shown below 0|10000001|01110000000000000000000

Note: What if the fraction part of a real number cannot be expressed as sum of powers of two (as in the above example .75 = (1/2 + 1/4) ex: 7/5 is exactly 1.4, .4 cannot be expressed in terms of sums of power two, 7/5 has infinity binary expansion 1.011001100110011001100.
In a single precision representation, the expansion is rounded off at the twenty-third digit after the binary point.

2. Extracting real number from SP floating point number representation
11000100000100110000000000000000

1|10001000|0100110000000000000000
S|-----E----|-------------M---------------|

Sign = 1 i.e (-1)1 = -1 negative number
Exponent (10001000) = 127 + e, 136 = 127 + e i.e. exponent = 9;
Mantissa = 1.01001100000000000000000

i.e. Mantissa = (one plus, plus no one halves, plus one quarter (/14), plus no one eight, plus no one sixteenth, plus one thirty second, plus one sixty fourth,…all zeros)

=> (1 + 1/4 + 1/32 + 1/64) = X
=> (64 + 16 + 2 + 1) = X x 64;
=> X = 83/64;

So the complete number = -(83/64) x 29 = -664.00;
Introduction to Floating point representation IEEE 754
Ask your questions below.
Previous Next

FLOATING POINT REPRESENTATION

An Introduction to IEE 754 floating point number representation:
IEEE 754 Floating point is the most common representation used to store real numbers in a computer.
Index
1. What are floating point numbers
2. How to represent a floating point number.
a. Single and Double precisions
3.Data ranges

What are floating point number:

The term ‘floating point’ refers to the decimal point (or) binary point in a real number.
The decimal point placed in between group of digits is called floating point numbers.

Ex: 28.36, 0.00124

Floating point number representation IEEE 754 standard:

Floating point number contains three components
1. Sign bit - Represents the sign of floating point number.
2. Exponent - Represents the magnitude of the exponent (explained at later part) (Ex: 1.086 * 10^6)
3. Mantissa - Represents the precision bits of the number. (Ex: 1.086 * 10^6)

Single precision floating point representation (32-bit):

Fig 1: SP floating point representation

S = 1bit; E = 8 bit - Min = 126, Max = 127, Bias = 127; M = 23bits.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
S  ------ E-8bit------ -----------------------------M-23bit-------------------

Double precision floating point representation (64-bit):

Fig 2: DP floating point representation

S = 1bit; E = 12 bit - Min = 1022, Max = 1023, Bias = 1023; M = 23bits.

Normalized floating point representation:

Components of a floating point representation

In general to maximize the representable numbers, FP numbers are typically stored in ‘normalized’ form. This puts a radix point after a non-zero digit. Fig 1 & 2 are normalized representation of floating point (FP) numbers.

In order to represent in more optimized way number is represented with base 2, since only non-zero value possible is 1, this is implicitly stored.

Demoralized floating point representation:

If the exponent is all zeros and the fraction part is non-zero’s, then the value is ‘denormalized’ number. Which does not have any assumed leading but as 1 before decimal point