Floating point

Single-precision binary
Double-precision binary representation
Subnormal numbers
References

Floating point is an approximation of real-number arithmetic using a finite numbers of bits. IEEE 754 is the floating point standard that’s used almost universally.

Floating point represents numbers similarly to scientific notation. A number is represented as:

$(- 1)^{sign} \times significand \times 2^{exponent}$ [1, P. 7].

The significand (or mantissa) contains the significant digits [1, P. 5].

The exponent signifies the integer power that 2 is raised to (or 10 in the case of decimal floating point) [1, P. 4].

The sign denotes the sign of the significand.

There are multiple floating point formats used to store the <sign, significand, exponent> triple. The common formats are single-precision and double-precision binary.

Each format also contains representations for +Inifinity, -Infinity, -0, +0, qNaN (quiet NaN), and sNaN (signaling NaN) [1, P. 8].

Single-precision binary

Single-precision binary representation uses 32 bits to encode a floating point number.

Figure: single precision floating point

The most significant bit is the sign bit [1, P. 9].

The 8-bit biased exponent field stores the exponent in biased notation. The bits represent an unsigned integer. The exponent is calculated by subtracting the bias ( $127$ ) from the stored value. $exponent = biased_exponent - 127$ [1, Pp. 9,13].

The 23-bit trailing significand field represents the fractional part of the significand (i.e. the part that comes after the binary point). The significand has an inferred integer part of 1, unless the exponent fields are all 0s or all 1s.

For normal numbers, the value of the trailing significand is an unsigned integer where $significand = 1 + 2^{- 23} \times trailing_significand_value$ .

Infinity is represented by all bias exponent field bits set to 1 and all trailing significand fields set to 0. The sign bit determines whether it is +Infinity or -Infinity [1, P. 9].

NaN is represented by all bias exponent field bits as 1 and the trailing significand field being nonzero [1, P. 9].

Subnormal numbers are represented by 0 exponent and nonzero trailing significand (see Subnormal numbers) [1, P. 9].

Zero is represented by all 0s, with the sign bit indicating whether it is +0 or -0 [1, P. 9].

Double-precision binary representation

Double-precision binary representation uses 64 bits to encode a floating point datum [1, P. 13].

Figure: double precision floating point

Double precision uses an 11-bit biased exponent field with a bias of 1023 [1, P. 13].

The trailing significand field is 52 bits.

Subnormal numbers

Subnormal numbers are numbers with a magnitude that is less than the minimum value that can be expressed by a format’s exponent. Subnormal numbers follow different encoding rules [1, P. 8].

For example, in single-precision binary encoding a subnormal number is defined as any number with all bits set to 0 in the exponent and a nonzero value in the trailing significand field. Subnormal single-precision numbers are calculated as $(- 1)^{sign} \times 2^{- 126} \times 2^{- 23} \times trailing_significand_value$ [1, P. 9].

References

[1] IEEE, “IEEE-754, Standard for Floating-Point Arithmetic,” IEEE Std 754-2008, pp. 1–58, Jan. 2008.