Tuesday, June 26, 2007

How Floating Point Works

A floating-point number is neither a Real number nor an Interval --- it is an exact rational number. A floating-point number is not uncertain or imprecise, it exactly and precisely has the value of a rational number. However, not every rational number is represented. Floating-point numbers are drawn from a carefully selected set of rational numbers. The particulars vary among the different floating-point formats. The popular IEEE 754 standard uses these: Single Precision The significand is an integer in the range [8388608, 16777215) which is multiplied by a power of 2 in the range 2-149 = 1/713623846352979940529142984724747568191373312 2104 = 20282409603651670423947251286016 Double Precision The significand is an integer in the range [4503599627370496, 9007199254740991) which is multiplied by a power of 2 in the range 2-1074 = 1/202402253307310618352495346718917307049556649764142118356901358027430339567995346891960383701437124495187077864316811911389808737385793476867013399940738509921517424276566361364466907742093216341239767678472745068562007483424692698618103355649159556340810056512358769552333414615230502532186327508646006263307707741093494784 2971 = 19958403095347198116563727130368385660674512604354575415025472424372118918689640657849579654926357010893424468441924952439724379883935936607391717982848314203200056729510856765175377214443629871826533567445439239933308104551208703888888552684480441575071209068757560416423584952303440099278848 Since the power of 2 can be negative, it is possible to represent fractions. For instance, the fraction 3/8 is represented as

            -25     12582912
12582912 * 2   = ------------- = 3/8
                    33554432

in single precision floating point. In double-precision, 3/8 is represented as

                    -54    6755399441055744
6755399441055744 * 2   = ------------------- = 3/8
                          18014398509481984
. Not every fraction can be represented. For example, the fractions 1/10 and 2/3 cannot be represented. Only those fractions whose denominator is a power of 2 can be represented. It is possible to represent integers. For instance, the integer 123456 is represented as

            -7    15802368
15802368 * 2   = --------- = 123456
                       128
. Not every integer can be represented. For example the integer 16777217 cannot be represented in single precision. -------- When floating-point numbers appear in programs as literal values, or when they are read by a program from some data source, they are usually expressed as a decimal number. For example, someone might enter "38.21" into a form. Some decimal numbers, like 0.125, can be represented as a floating-point number, but most fractional decimal numbers have no representation. If a literal number cannot be represented, then the system usually silently substitutes a nearby representable number. For example, if the expression 0.1 appears as a literal in the code,the actual floating point value used may be

            -27     13421773 
13421773 * 2   = ------------- > .1
                   134217728 
. as another example, 38.21

            -18     10016522
10016522 * 2   = ----------- < 38.21
                      262144
. Most implementations substitute the nearest representable number,but this has not always been the case. As it turns out, much of the error in floating-point code comes from the silent substitution performed on input. -------- Floating-point numbers are usually printed as decimal numbers. All binary floating-point numbers have an exact decimal equivalent, but it usually has too many digits to be practical. For example, the single-precision floating-point number

            -25     13421773
13421773 * 2    = ----------- = 0.4000000059604644775390625
                    33554432
. Furthermore, the extra digits may be misconstrued as precision that doesn't exist. The next largest float above that:

             -25     6710887
13421774 * 2    = ----------- = 0.400000035762786865234375
                    16777216
. doesn't share any digits beyond the first 7. There is no standard way to print the decimal expansion of a floating-point number and this can lead to some bizarre behavior. However, recent floating point systems print decimal numbers with only as many digits as necessary to ensure that the number will be exactly reconstructed if it is read in. This is intended to work correctly even if the implementation must substitute a nearby representable number on input. This allows one to `round-trip' floating-point numbers through their decimal representation without error. (NOTE: Not all floating point systems use this method of parsimonious printing. One example is Python.) There is a curious side effect of this method of printing. A literal floating-point number will often print as itself (in decimal) even if it was substituted in binary. For example, if the expression 0.1 appears as a literal in the code, and is substituted by

            -27      13421773 
13421773 * 2    = -------------
                    134217728 
. this substitute number will be printed as "0.1" in decimal even though it is slightly larger than 0.1 As mentioned above, Python does not use parsimonious printing, so it is often the case in Python that a literal floating-point number will not print as itself. For example, typing .4 at the Python prompt will return "0.40000000000000002" The round-trip property is nice to have, and it is nice that floating-point fractions don't seem to change, but it is important to know this fact: The printed decimal representation of a floating-point number may be slightly larger or smaller than the internal binary representation. -------- Arithmetic Simple arithmetic (add, subtract, multiply, or divide) on floating-point numbers works like this: 1. Perform the operation exactly. Treat the inputs as exact rationals and calculate the exact rational result. 2a. If the exact rational result can be represented, use it. 2b. If the exact rational result cannot be represented, use a nearby exact rational that can be represented. (round) This is important, and you may find this surprising: Floating-point operations will produce exact results without rounding if the result can be represented. There are a number of interesting cases where we can be assured that exact arithmetic will be used. One is with addition, multiplication,and subtraction of integers within the range of the significand. For single-precision floating point, this is the range [-16777215,16777215]. For double-precision, [-9007199254740991,9007199254740991]. Even integer division (modulo and remainder) will produce exact results if the inputs are integers in the correct range. Multiplication or division by powers of two will also produce exact results in any range (provided they don't overflow or underflow). In many cases floating-point arithmetic will be exact and the only source of error will be the substitution when converting the input from decimal to binary and how the result is printed. However, in most cases other than integer arithmetic, the result of a computation will not be representable in the floating point format. In these case, rule 2b says that the floating-point system may substitute a nearby representable number. This number, like all floating-point numbers, is an exact rational, but it is wrong. It will differ from the correct answer by a tiny amount. For example, in single precision adding .1 to 0.375:

             -27     13421773 
13421773 * 2    = ------------- ~= 0.1
                    134217728 

             -25     12582912
12582912 * 2   = ------------- = 0.375
                     33554432
. gives us a result that is slightly smaller than the true answer of 19/40

            -25     15938355       19
15938355 * 2   = ------------- < ---
                    33554432       40

-------- Further topics 1. Denormalized numbers and gradual underflow. These give you a `soft landing' when you use numbers really close to 0. 2. Infinities 3. Negative 0.0 4. NaN's 5. Traps and flags -------- Floating point myths Myth: Floating-point numbers are an approximation. Fact: Floating-point numbers are exact, rational numbers. Myth: Floating-point math is always only an approximation, and all results are off by a small amount. Fact: Many floating-point operations produce exact, correct answers. Myth: Floating-point math is mysterious. Fact: It is the same old rational arithmetic you've used for ages. -------- Bibliography @article{ goldberg91what, author = "David Goldberg", title = "What Every Computer Scientist Should Know About Floating-Point Arithmetic", journal = "ACM Computing Surveys", volume = "23", number = "1", pages = "5--48", year = "1991", url = "citeseer.ist.psu.edu/goldberg91what.html" } All the papers at http://www.cs.berkeley.edu/~wkahan/ Joe Darcy's talk at http://blogs.sun.com/darcy/resource/Wecpskafpa-ACCU.pdf
Post a Comment