Floating point data type issues and solution

Amalakanthan R
5 min readMay 31, 2021

--

Real numbers are defined using floating point numbers in VHDL (The VHSIC Hardware Description Language), and the predefined floating point type in VHDL is named real. A floating point number in the range of 1.0e38 to +10e38 is defined by this. Because most commercial synthesis devices are floating point, they do not handle real values exactly. This is a critical issue for many FPGA designs. In actuality, integer or fixed point numbers must be used because they can be directly and easily synthesized into hardware.

As you can see, it’s a straightforward math, and you already know the answer. As a result, the right answer is 0.4. However, this is the response that our system provides.

Isn’t this not the exact answer you calculated? There is a reason for this: when a computer must use floating-point numbers, it encounters a problem. The term for this is “Floating-Point Rounding Error.” Let’s start with the basics of Floating-Point numbers before going on to the Floating-Point Rounding Error.

IEEE 754 Standard

The IEEE standard is the most widely used floating point standard. Floating point numbers are expressed with 32 bits (single precision) or 64 bits (double precision) according to this standard (double precision).

In this section, we’ll solely look at the format of 32-bit floating point values and how to execute mathematical operations on them.

32-bit floating point numbers are represented as follows according to the IEEE standard:

Sign: This is essentially a representation of the number’s sign (Positive or Negative).

Exponent: Both positive and negative exponents are represented in this subject. To acquire the stored exponent, a bias is added to the real exponent.

Mantissa: The important digits of a number in scientific notation or a floating-point number make up this section.

Why Floating point Rounding Error occur in computer?

In order to fit an infinite number of real numbers into a finite number of bits, an approximate representation is required. Although there are an endless number of integers, most programs can store the results of integer computations in 32 bits. Most real-number operations, on the other hand, will create amounts that cannot be precisely represented with that many bits, given any fixed number of bits. As a result, the output of a floating-point calculation is frequently rounded to fit back into its finite form. This rounding error is a common occurrence in floating-point calculations. How it’s measured is described in the section Relative Error and Ulps.

Does it matter if the basic arithmetic operations involve a little more rounding error than necessary, since most floating-point calculations involve rounding error anyway? This question is a recurring theme in this section. Guard digits, a method of reducing mistake while subtracting two neighboring numbers, are discussed under the section Guard Digits. IBM deemed guard digits to be so vital that it added one to the double precision format in the System/360 architecture in 1968 (single precision already had one), and updated all existing computers in the field. To demonstrate the value of guard digits, two instances are given.

The IEEE standard requires more than just the use of a guard digit. It specifies an algorithm for addition, subtraction, multiplication, division, and square root, as well as a requirement that implementations provide the same result. As a result, assuming both computers meet the IEEE standard, the results of basic operations will be identical in every bit when a program is transported from one machine to another. The porting of programs is substantially simplified as a result of this. In Exactly Rounded Operations, you’ll find further examples of how to use this precise specification.

Solutions for Floating point error

Rational

Represent the number with a numerator and denominator as a whole portion and rational number. The number 15.589 is written as w: 15, n: 589, and d: 1000.

This includes calculating the LCM and then adding the two integers when added to 0.25 (which is w: 0; n: 1; d: 4). This works well in many instances, but when working with a large number of rational numbers that are relatively prime to each other, it can result in very huge numbers.

Fixed Point

You have both the whole and decimal parts. All figures are rounded to that precision (there’s that term again — but you know what it means). You may, for example, have a fixed point with three decimal points. For the decimal component, 15.589 + 0.250 equals 589 + 250 percent 1000. (and then any carry to the whole part). This integrates well with current databases. As previously said, there is some rounding, but you know where it is and can set it to be more exact than is required (you are only measuring to 3 decimal points, so make it fixed 4).

Big Decimal

The BigDecimal class in Java can be used to perform computations using floating-point numbers and obtain correct results.

The BigDecimal class supports arithmetic, comparison, rounding, and hashing operations on floating-point values. It has excellent precision when dealing with both large and tiny floating-point numbers.

As you can see, the BigDecimal class allows us to perform computations on floating-point values while avoiding the Floating-Point Rounding Error. The BigDecimal class also has a number of methods for performing arithmetic operations.

--

--

No responses yet