A Machine Learning and Data Science Blog: Format for representing fixed point numbers

Fixed point numbers and number notation:

In Fixed-point numbers the imaginary binary point plays a significant role while interpreting numbers. When dealing with numbers the digits to the left of imaginary binary point represents the integer and the digits to the right of binary point represents fractions. Here we study a format for representing fixed point numbers which we will be using here on.

Q notation (QF):

In this notation the total numbers of fractional bits represent the number format. One bit is assumed for Sign (MSB always). The number format representation doesn’t convey any information regarding the word length.
The notation QF means a number with F bits dedicated for the fractional part. If W represents the word length of the processor, then QF number means, F Fractional Bits, W - (F+1) Integer Bits and
1 Sign Bit.
If F is the number of bits required to represent fraction part QF then
Number of bits to represent Integer Bits = W - (F+1)
By default '1' bit is used for sign representation.

For example,

Q15 means 15 Fractional bits, one Sign Bit and zero bits for integer part. Ex:0.435.
Q14 means 14 Fractional bits, one Sign Bit and 1 bit for integer part. Ex: 1.435
Q1 means 1 Fractional bit, one Sign Bit and 14 bits for integer part. Ex: 16383.435

How to select a Q-point format, to represent a float value in fixed point format

For example consider a float value 12.435

Steps involved in selecting Q-format for the above given value:

1. Calculate the number of bits needed to represent integer part (QI) of the given float value (4 bits).
2. Calculate the word length needed to represent the float value in fixed point
a. 1 bit for sign representation
b. 4 bits for integer representation.
c. QF = 10, from equation discussed previous post ceiling (log2 (1/ ε)) for given example , ε = 0.0001
d. WL = QI + QF + S, therefore WL = (4 + 10 + 1) = 15 (bits) to guaranty both range and resolution.
3. So Q-format need to represent above float numbers is Q10 format.

Simple Procedure to identify Q-format for given float point value

Assuming that the word length used by the programmer to represent a float value in fixed point format is WL = 16. (This assumption should be made by programmer depending up on the maximum range of input float value).

Note: Float to fixed conversion comes at cost of precision loss.

Consider above example:

Float value = 12.435

(12)10 -> (1100)2 -> 4 bits are required to represent integer part.
Sign -> 1 bit for sing representation

As we have seen before, number of fractional bits taken by float value is the Q-point format with which the float value is represented in fixed point.

Q-format for the above float value is = (16 – (4 + 1)) i.e. QF = 11

So the above float value can be represented in Q11 format.

A Machine Learning and Data Science Blog

Format for representing fixed point numbers

1 comment:

Related Posts

Twitter Updates

Random Posts

Disclaimer

Recent Comments