Floating-point Operations Tutorial

Programming & Compiler topics
Post Reply
Tony
Lieutenant
Lieutenant
Posts: 86
Joined: Tue Jul 21, 2009 4:11 pm

Floating-point Operations Tutorial

Post by Tony » Thu Sep 24, 2009 11:12 pm

This article discusses about the arithmetic operation likes addition, subtraction, multiplication and division of floating point numbers.

Floating point addition:
How floating addition works? Here are the steps involve in floating point addition.
Assume both the operands are in IEEE 754 floating point format. Performing floating point addition between ‘A’ and ‘B’

i.e. A = (-1) ^S x 2^Ae x 1.Am
S = sign bit,
Ae = Exponent of operand A i.e. E – 127 (assuming Single precision),
Am = Mantissa. Similarly for operand B (S, Be, Bm).

Therefore A + B = (Am x 2^Ae + Bm x 2^Be).

Step by step procedure explaining floating point addition:
  1. Aligning binary points of A and B
    1. Compare Ae and Be. Take the lager and compute exponent
    difference Be – Ae (Be > Ae).
  2. If Ae > Be right shift Bm that many positions to form Bm x 2^(Be – Ae).
    If Be > Ae right shift Am that many positions to form Am x 2^(Ae – Be).
  3. Compute the sum of aligned mantissa i.e.
    Bm x 2^(Be – Ae) + Am (or) Am x 2^(Be – Ae) + Bm
  4. If normalization of result is need, steps to perform
    1. If result looks like (0.001001…) then reduce the exponent by
      left shifting the result.
    2. If result looks like (101.01001…) then increase the exponent by
      right shifting the result.
      Continue above steps (a or b) until MSB (hidden bit in IEEE 754 standard) is 1.
  5. Check result exponent.
    1. If larger than allowed exponent allowed return exponent overflow.
    2. If smaller than allowed exponent allowed return exponent underflow.
  6. If mantissa is equal to zero set exponent to zero.

Floating point subtraction:
How its works? Here are the steps involve in floating point subtraction.
Assume both the operands are in IEEE 754 floating point format. Performing floating point subtraction between ‘A’ and ‘B’
Single Precision, floating point representation. A = (-1)^ S x 2^Ae x Am
S = sign bit,
Ae = Exponent of operand A i.e. E – 127 (assuming Single precision),
Am = Mantissa. Similarly for operand B (S, Be, Bm).

Therefore A - B = (Am x 2^Ae - Bm x 2^Be).

Step by step procedure explaining floating point subtraction:
  1. Aligning binary points of A and B
    1. Compare Ae and Be. Take the lager and compute exponent difference Be – Ae (Be > Ae).
  2. If Ae > Be right shift Bm that many positions to form Bm x 2^(Be – Ae).(or) If Be > Ae right shift Am that many positions to form Am x 2^(Ae – Be).
  3. Compute the sum of aligned mantissa i.e.
    Bm x 2^(Be – Ae) - Am (or) Am x 2^(Be – Ae) - Bm
  4. If normalization of result is need, steps to perform
    1. If result looks like (0.001001…) then reduce the exponent by left shifting the result.
    2. If result looks like (101.01001…) then increase the exponent by right shifting the result.
      Continue above steps (a or b) until MSB (hidden bit in IEEE 754 standard) is 1.
  5. Check result exponent.
    1. If larger than allowed exponent allowed return exponent overflow.
    2. If smaller than allowed exponent allowed return exponent underflow.
  6. If mantissa is equal to zero set exponent to zero.
Floating point multiplication:
How it works? Here are the steps involve in floating point multiplication.
Assume both the operands are in IEEE 754 floating point format. Performing floating point multiplication between ‘A’ and ‘B’

Single Precision, floating point representation. A = (-1)^ (AS) x 2^Ae x Am
Step by step procedure explaining floating point subtraction
  1. If any of the operands are equal to zero, return result as zero.
  2. Compute the sign: AS XOR BS
  3. Multiply the mantissa’s : Am x Bm, and round it to allowed number of mantissa bits.
  4. Compute the exponent of the result.
    1. Result exponent = biased exponent A + biased exponent B – bias;
  5. Normalize the result shift the mantissa, increment result exponent if needed
  6. Check the result exponent:
    1. If larger than maximum exponent allowed then return overflow.
    2. If smaller than maximum exponent allowed then return underflow.
Post Reply

Return to “Programming Languages & Compiler Theory”