Introduction to IEEE 754 floating point representation

Programming & Compiler topics
Post Reply
Tony
Lieutenant
Lieutenant
Posts: 86
Joined: Tue Jul 21, 2009 4:11 pm

Introduction to IEEE 754 floating point representation

Post by Tony » Thu Sep 24, 2009 11:40 pm

IEEE 754 Floating point is the most common representation used to store real numbers in a computer.
Index
1. What are floating point numbers
2. How to represent a floating point number.
a. Single and Double precisions
3.Data ranges

What are floating point number:
The term ‘floating point’ refers to the decimal point (or) binary point in a real number.
The decimal point placed in between group of digits is called floating point numbers.

Ex: 28.36, 0.00124

Floating point number representation IEEE 754 standard:
Floating point number contains three components
1. Sign bit - Represents the sign of floating point number.
2. Exponent - Represents the magnitude of the exponent (explained at later part) (Ex: 1.086 * 10^6)
3. Mantissa - Represents the precision bits of the number. (Ex: 1.086 * 10^6)

Single precision floating point representation (32-bit):
fp.JPG
fp.JPG (4.49 KiB) Viewed 7054 times
Fig 1: SP floating point representation
S = 1bit; E = 8 bit - Min = 126, Max = 127, Bias = 127; M = 23bits.

S | ------ E-8bit------||-----------------------------M-23bit-------------------|

Double precision floating point representation (64-bit):
fp1.JPG
fp1.JPG (2.74 KiB) Viewed 7054 times
Fig 2: DP floating point representation
S = 1bit; E = 12 bit - Min = 1022, Max = 1023, Bias = 1023; M = 23bits.

Normalized floating point representation:
Components of a floating point representation
fp2.JPG
fp2.JPG (2.31 KiB) Viewed 7054 times
In general to maximize the representable numbers, FP numbers are typically stored in ‘normalized’ form. This puts a radix point after a non-zero digit. Fig 1 & 2 are normalized representation of floating point (FP) numbers.
In order to represent in more optimized way number is represented with base 2, since only non-zero value possible is 1, this is implicitly stored.

Demoralized floating point representation:
If the exponent is all zeros and the fraction part is non-zero’s, then the value is ‘denormalized’ number. Which does not have any assumed leading but as 1 before decimal point
fp2.JPG
fp2.JPG (2.31 KiB) Viewed 7054 times
Ranges of floating point numbers:
fp3.JPG
fp3.JPG (7.5 KiB) Viewed 7054 times

Representation of floating point numbers:

In the IEEE Single-precision representation of a real number, one bit used to represent sing , and it is set 0 for positive number and 1 for negative one. A representation of the exponent is stored in next 8bits and the remaining twenty-three bits are occupied by a representation of the mantissa of the number

Here are some examples:
How to represent real numbers in floating point format:


Examples:
1. Representing 23/4 in single precision floating point number.
=> 23/4 = 5.75
Converting above real number to binary form
=> 101.11 (5 in binary 101, .(2^-1 + 2^-2) = .75)

Representing above binary to SP floating point format (32bit)
[(-1)^S x 2^(E – 127) x 1.M]
=> 1.0111 x 2^2 relating this to above given equation

(Numeric ‘1’ before decimal point is called hidden bit as it is by default given in representation).

Sing S = 0; No. of bits used to represent exponent = 1
Exponent (E – 127) = 2 i.e. E = 129; No. of bits used to represent exponent = 8
Mantissa M = 0111000…. ; No. of bits used to represent Mantissa = 23

Finally 5.75 in SP floating point representations is as shown below 0|10000001|01110000000000000000000

Note: What if the fraction part of a real number cannot be expressed as sum of powers of two (as in the above example .75 = (1/2 + 1/4) ex: 7/5 is exactly 1.4, .4 cannot be expressed in terms of sums of power two, 7/5 has infinity binary expansion 1.011001100110011001100.
In a single precision representation, the expansion is rounded off at the twenty-third digit after the binary point.

2. Extracting real number from SP floating point number representation
11000100000100110000000000000000

1|10001000|0100110000000000000000
S|-----E----|-------------M---------------|

Sign = 1 i.e (-1)1 = -1 negative number
Exponent (10001000) = 127 + e, 136 = 127 + e i.e. exponent = 9;
Mantissa = 1.01001100000000000000000

i.e. Mantissa = (one plus, plus no one halves, plus one quarter (/14), plus no one eight, plus no one sixteenth, plus one thirty second, plus one sixty fourth,…all zeros)

=> (1 + 1/4 + 1/32 + 1/64) = X
=> (64 + 16 + 2 + 1) = X x 64;
=> X = 83/64;

So the complete number = -(83/64) x 29 = -664.00;


What are denormalized floating point numbers? How are they represented?
If you have noticed, from the previous discussion on floating point representation (Click here - previous discussion) there are few serious concern in the IEEE 754 representation itself.

(-1)^S x 2^(E-127) x 1.M

IEEE 754 Sing precision floating point representation

How to represent 0.0 in IEEE 754 floating point representation? It is not possible to represent zero, as the product of power of two and mantissa greater than or equal to one.

So how we represent 0.0 then?
Here is the explanation: In IEEE representation all zero ‘E’ exponent is used to represent numbers close to zero (closer to 2^-126 SP floating point representation), which is the least positive real number in the part of the system that can be represented as discussed in earlier posts (Click here – Earlier posts).

i.e. 0|00000001|00000000000000000000000
S|-----E-----|---------------M---------------|

This kind of numbers (Closer to zero) are represented in slightly different way.

Keeping the exponent always equal to -126, mantissa number greater than or equal to zero and less than one (i.e. 0.M instead of 1.M)

Here is the example how to represent number very close to zero:

Consider => 5 x 2^-129

Mantissa used to represent the above number is as explained

=> [5 / (2^3)] x 2^-126

=> [0.625 x 2^-126]

=> [0.625 x 2^-126]

=> [(1/2 + 1/8) x 2^-126] = (0.101) x 2^-126

So the representation of 5 x 2^-129 is as shown below

0|00000001|01010000000000000000000|
S|--E-8bit---|0.M-----------23bit-----------|

Mantissa less than one are said to be Denormalized number

Denormalized numbers are stored less accurately than normalized numbers.
So, the least positive real number that can be represented is 2-149 as shown below.

For Single precision

(-1) ^S x 2^(E – 127) x 0.M

Substitute S = 0, E-127 = -127 i.e. E = 0; and M (23 bit) i.e. 2^-23

So the least positive real number = 2^-(127 + 23) = 2^-149

i.e. 0|00000000|00000000000000000000001
S|---E-8bit--|0.M----------23bit-------------|
Post Reply

Return to “Programming Languages & Compiler Theory”