IEEE 754 Floating point is the most common representation used to store real numbers in a computer.
Index
1. What are floating point numbers
2. How to represent a floating point number.
a. Single and Double precisions
3.Data ranges
What are floating point number:
The term ‘floating point’ refers to the decimal point (or) binary point in a real number.
The decimal point placed in between group of digits is called floating point numbers.
Ex: 28.36, 0.00124
Floating point number representation IEEE 754 standard:
Floating point number contains three components
1. Sign bit - Represents the sign of floating point number.
2. Exponent - Represents the magnitude of the exponent (explained at later part) (Ex: 1.086 * 10^6)
3. Mantissa - Represents the precision bits of the number. (Ex: 1.086 * 10^6)
Single precision floating point representation (32-bit):
Fig 1: SP floating point representation
S = 1bit; E = 8 bit - Min = 126, Max = 127, Bias = 127; M = 23bits.
S | ------ E-8bit------||-----------------------------M-23bit-------------------|
Double precision floating point representation (64-bit):
Fig 2: DP floating point representation
S = 1bit; E = 12 bit - Min = 1022, Max = 1023, Bias = 1023; M = 23bits.
Normalized floating point representation:
Components of a floating point representation
In general to maximize the representable numbers, FP numbers are typically stored in ‘normalized’ form. This puts a radix point after a non-zero digit. Fig 1 & 2 are normalized representation of floating point (FP) numbers.
In order to represent in more optimized way number is represented with base 2, since only non-zero value possible is 1, this is implicitly stored.
Demoralized floating point representation:
If the exponent is all zeros and the fraction part is non-zero’s, then the value is ‘denormalized’ number. Which does not have any assumed leading but as 1 before decimal point
Ranges of floating point numbers:
Representation of floating point numbers:
In the IEEE Single-precision representation of a real number, one bit used to represent sing , and it is set 0 for positive number and 1 for negative one. A representation of the exponent is stored in next 8bits and the remaining twenty-three bits are occupied by a representation of the mantissa of the number
Here are some examples:
How to represent real numbers in floating point format:
Examples:
1. Representing 23/4 in single precision floating point number.
=> 23/4 = 5.75
Converting above real number to binary form
=> 101.11 (5 in binary 101, .(2^-1 + 2^-2) = .75)
Representing above binary to SP floating point format (32bit)
[(-1)^S x 2^(E – 127) x 1.M]
=> 1.0111 x 2^2 relating this to above given equation
(Numeric ‘1’ before decimal point is called hidden bit as it is by default given in representation).
Sing S = 0; No. of bits used to represent exponent = 1
Exponent (E – 127) = 2 i.e. E = 129; No. of bits used to represent exponent = 8
Mantissa M = 0111000…. ; No. of bits used to represent Mantissa = 23
Finally 5.75 in SP floating point representations is as shown below 0|10000001|01110000000000000000000
Note: What if the fraction part of a real number cannot be expressed as sum of powers of two (as in the above example .75 = (1/2 + 1/4) ex: 7/5 is exactly 1.4, .4 cannot be expressed in terms of sums of power two, 7/5 has infinity binary expansion 1.011001100110011001100.
In a single precision representation, the expansion is rounded off at the twenty-third digit after the binary point.
2. Extracting real number from SP floating point number representation
11000100000100110000000000000000
1|10001000|0100110000000000000000
S|-----E----|-------------M---------------|
Sign = 1 i.e (-1)1 = -1 negative number
Exponent (10001000) = 127 + e, 136 = 127 + e i.e. exponent = 9;
Mantissa = 1.01001100000000000000000
i.e. Mantissa = (one plus, plus no one halves, plus one quarter (/14), plus no one eight, plus no one sixteenth, plus one thirty second, plus one sixty fourth,…all zeros)
=> (1 + 1/4 + 1/32 + 1/64) = X
=> (64 + 16 + 2 + 1) = X x 64;
=> X = 83/64;
So the complete number = -(83/64) x 29 = -664.00;
What are denormalized floating point numbers? How are they represented?
If you have noticed, from the previous discussion on floating point representation (Click here - previous discussion) there are few serious concern in the IEEE 754 representation itself.
(-1)^S x 2^(E-127) x 1.M
IEEE 754 Sing precision floating point representation
How to represent 0.0 in IEEE 754 floating point representation? It is not possible to represent zero, as the product of power of two and mantissa greater than or equal to one.
So how we represent 0.0 then?
Here is the explanation: In IEEE representation all zero ‘E’ exponent is used to represent numbers close to zero (closer to 2^-126 SP floating point representation), which is the least positive real number in the part of the system that can be represented as discussed in earlier posts (Click here – Earlier posts).
i.e. 0|00000001|00000000000000000000000
S|-----E-----|---------------M---------------|
This kind of numbers (Closer to zero) are represented in slightly different way.
Keeping the exponent always equal to -126, mantissa number greater than or equal to zero and less than one (i.e. 0.M instead of 1.M)
Here is the example how to represent number very close to zero:
Consider => 5 x 2^-129
Mantissa used to represent the above number is as explained
=> [5 / (2^3)] x 2^-126
=> [0.625 x 2^-126]
=> [0.625 x 2^-126]
=> [(1/2 + 1/8) x 2^-126] = (0.101) x 2^-126
So the representation of 5 x 2^-129 is as shown below
0|00000001|01010000000000000000000|
S|--E-8bit---|0.M-----------23bit-----------|
Mantissa less than one are said to be Denormalized number
Denormalized numbers are stored less accurately than normalized numbers.
So, the least positive real number that can be represented is 2-149 as shown below.
For Single precision
(-1) ^S x 2^(E – 127) x 0.M
Substitute S = 0, E-127 = -127 i.e. E = 0; and M (23 bit) i.e. 2^-23
So the least positive real number = 2^-(127 + 23) = 2^-149
i.e. 0|00000000|00000000000000000000001
S|---E-8bit--|0.M----------23bit-------------|
- Board index
- Search
-
- It is currently Fri Nov 01, 2024 5:38 am
- All times are UTC+05:30
Introduction to IEEE 754 floating point representation
Programming & Compiler topics
Return to “Programming Languages & Compiler Theory”
Jump to
- Programmable Electronics
- ↳ Arduino
- ↳ Raspberry Pi
- ↳ Microcontrollers
- ↳ FPGA
- ↳ Digital Signal Processors
- ↳ Other
- Programming
- ↳ Web programming
- ↳ PHP & MySQL
- ↳ ASP & ASP.Net
- ↳ .Net & Other Programming
- ↳ .NET Programming
- ↳ Visual Basic Programming
- ↳ Java Programming
- ↳ C/C++ Programming
- Engineering
- ↳ Electronics & Electrical Engineering
- ↳ Embedded Systems
- ↳ Computer Science
- ↳ Software Engineering
- ↳ Data Structures & Algorithms
- ↳ Programming Languages & Compiler Theory
- ↳ Operating Systems
- ↳ Cryptography
- ↳ Computer Networks
- ↳ SQL & Database
- ↳ Computer Architecture
- ↳ Graphics & Vision
- ↳ Artificial Intelligence
- ↳ Neural Networks
- ↳ Multimedia
- ↳ Mathematics
- ↳ Other
- ↳ Control Systems & Robotics
- ↳ Mechanical
- ↳ Thermodynamics
- ↳ Fluid Dynamics
- ↳ Aerodynamics
- ↳ Manufacturing
- ↳ Energy
- ↳ Dynamics
- ↳ Statics
- ↳ Automobile
- ↳ Other
- ↳ Other
- Operating Systems
- ↳ Windows
- ↳ Linux
- ↳ Mac OS
- ↳ Android
- ????? ????
- ↳ ???????? ?????
- ↳ ??????? ???? ?????
- ↳ ????? ?????? ???? (Buy Guide)
- ↳ ??????? ???? ??????? (Where to buy)
- ↳ ????????? ???????? (Recommend - Complain - Review)
- General
- ↳ News & Announcements
- ↳ General Discussions
- ↳ Viruses, Trojans, Spyware and Adware
- ↳ Computer & Network Security
- ↳ Web Related
- Members Zone
- ↳ Project Assistance
- ↳ Advertising
- ↳ Jobs & Investment Opportunities
- ↳ Introductions
- ↳ Presents & Donations
- ↳ Entertainment
- ↳ Music & Albums
- ↳ Movies
- ↳ Games