Floating point problem in computer

Amalakanthan R
4 min readMay 23, 2021

What are Floating point errors?

Computers aren’t always as precise as we believe. They are really good at what they are told to do and can complete tasks quickly. However, in many situations, a minor error may have major ramifications. Floating point errors are a well-known problem. The accuracy with which a number can be represented with floating point numbers is limited. The actual number stored in memory is always rounded to the nearest integer. Although the accuracy is extremely high and beyond the reach of most applications, even a minor error will add up and trigger problems in some circumstances. Such conditions must be avoided by conducting extensive research in critical applications.

Errors In Floating Point Calculations

A binary number can exactly represent any decimal integer (1, 10, 3462, 948503, etc.). The only drawback is that in programming, a number form normally has lower and upper limits. A 32-bit integer sort, for example, can represent:

• 4,294,967,296 values in total

• Signed type: from -231 to 231–1 (-2,147,483,648 to 2,147,483,647 )

• Unsigned type: from 0 to 232–1 (0 to 4,294,967,295)

The constraints are straightforward, and the integer type can represent any whole number within those parameters.

However, in the fractional part of a number, floating point numbers have additional limitations (everything comes after the decimal point). The floating point number format used in computers, like the integer format, is limited to a certain dimension (number of bits). As a consequence, the precision with which it can represent a number is limited.

If a calculation’s value is rounded and then used for further calculations, the rounding error will skew any subsequent results. Since the size of floating point numbers is restricted, they can only display a limited range of numbers. All in the middle must be rounded to the nearest whole number. This can result in (often minor) errors in a stored number. The most vulnerable systems are those that may perform a large number of calculations or that run for months or years without being restarted.

The problem of scale is another concern that arises with floating point numbers. The scale of the number is determined by the exponent, which means it can be used for very large or very small numbers. When two numbers with very different scales are used in a calculation (for example, a very large number and a very small number), the small numbers which become lost because they do not fit into the larger number’s scale.

Rounding In Decimal Numbers and Fractions

Examples from our well-known decimal method can be used to better explain the issue of binary floating point rounding errors. The fraction 1/3 seems to be very straightforward. Its consequence is a little more complicated: 0.333333333…with an infinite number of 3s repeated. Also in our well-known decimal system, there are times when there are so many digits. We often reduce (round) numbers to a size that is comfortable for us and meets our requirements. 1/3, for example, can be written as 0.333.

What happens if we try to add (1/3) and (1/3) together? We get 0.666 when we add the results 0.333 + 0.333. When we explicitly add the fractions (1/3) + (1/3), we get 0.6666666. We’d most likely round it to 0.667 if we had an infinite number of 6s.

This example demonstrates how limiting ourselves to a certain number of digits causes us to lose accuracy quickly. We’ve already missed a section that may or may not be significant after just one addition (depending on our situation). Consider a computer device that can only represent three fractional digits. The example above illustrates how the use of rounded intermediate results can spread and lead to incorrect end results.

Binary floating point formats, like the example above, can represent a lot more than three fractional digits. Even if the error is much smaller when the 100th or 1000th fractional digit is removed, it can have significant consequences if the results are processed further by long calculations or if the results are used repeatedly to propagate the error.

Numbers in Computers

A machine must perform the exact actions depicted in the preceding illustration. Since the binary system only has a limited set of numbers, it is often forced to try to get as near as possible. The precision of floating point number forms is, of course, far higher.

Floating Point In Binary

The IEEE 754 standard (Institute of Electrical and Electronics Engineers, a broad organization that describes standards) defines the two most common floating point storage formats:

32-bit short true (also called single precision)

1 bit for the symbol, 8 bits for the exponent, and 23 bits for the mantissa

64-bit long real (also called double precision)

1 bit for the symbol, 11 bits for the exponent, and 52 bits for the mantissa

Here’s an example of a scientific notation for a floating point number: +34.890625*104. In the case, the sign bit is the plus. The exponent is 4 and the mantissa is 34.890625. The foundation is ten since we are using the decimal system.

Here’s some more detail on the bit areas:

• Sign

Just a smidgeon. If a number is negative, the sign-bit shows it. As a result, a 1 indicates a negative number, while a 0 indicates a positive number.

• mantissa is a form of mantissa that is

8388608 different values can be represented by 23 bits (in single precision floating point). The significand is another name for it.

• Exponent

Exponent 8 bits (floating point with single precision) can represent 256 different values. However, with a negative and positive range of 8 bits, to negative-positive range

Enable both positive and negative exponents; the first half of the range (0–127) is used for negative exponents, while the second half (128–255) is used for positive exponents. 5+127=132 is the value of a positive exponent 105. The value of a negative exponent 10–8 is -8+127=119.

--

--