Friday, August 27, 2021

A Floating-point Problem

Here's a 2x2 matrix:

[64919121   -159018721]
[41869520.5 -102558961]
We can multiply it by a 2 element vector like this:
(defun mxv (a b
            c d

            x
            y

            receiver)
  (funcall receiver
           (+ (* a x) (* b y))
           (+ (* c x) (* d y))))

* (mxv 64919121     -159018721
       41869520.5d0 -102558961
 
       3
       1

       #'list)

(35738642 2.30496005d7)
Given a matrix and a result, we want to find the 2 element vector that produces that result. To do this, we compute the inverse of the matrix:
(defun m-inverse (a b
                  c d

                  receiver)
  (let ((det (- (* a d) (* b c))))
    (funcall receiver
             (/ d det) (/ (- b) det)
             (/ (- c) det) (/ a det))))
and multiply the inverse matrix by the result:
(defun solve (a b
              c d

              x
              y

              receiver)
  (m-inverse a b
             c d
             (lambda (ia ib
                      ic id)
               (mxv ia ib
                    ic id

                    x
                    y
                    receiver))))
So we can try this on our matrix
* (solve 64919121     -159018721
         41869520.5d0 -102558961

         1
         0
         #'list)

(1.02558961d8 4.18695205d7)
and we get the wrong answer.

What's the right answer?

* (solve 64919121         -159018721
         (+ 41869520 1/2) -102558961

         1
         0
         #'list)

(205117922 83739041)
If we use double precision floating point, we get the wrong answer by a considerable margin.

I'm used to floating point calculations being off a little in the least significant digits, and I've seen how the errors can accumulate in an iterative calculation, but here we've lost all the significant digits in a straightforward non-iterative calculation. Here's what happened: The determinant of our matrix is computed by subtracting the product of the two diagonals. One diagonal is (* 64919121 -102558961) = -6658037598793281, where the other diagonal is (* (+ 41869520 1/2) -159018721) = -6658037598793280.5 This second diagonal product cannot be represented in double precision floating point, so it is rounded down to -6658037598793280. This is where the error is introduced. An error of .5 in a quantity of -6658037598793281 is small indeed, but we amplify this error when we subtract out the other diagonal. We still have an absolute error of .5, but now it occurs within a quantity of 1, which makes it relatively huge. This is called “catastrophic cancellation” because the subtraction “cancelled” all the significant digits (the “catastrophe” is presumably the amplification of the error).

I don't care for the term “catastrophic cancellation” because it places the blame on the operation of subtraction. But the subtraction did nothing wrong. The difference betweeen -6658037598793280 and -6658037598793281 is 1 and that is the result we got. It was the rounding in the prior step that introduced an incorrect value into the calculation. The subtraction just exposed this and made it obvious.

One could be cynical and reject floating point operations as being too unreliable. When we used exact rationals, we got the exactly correct result. But rational numbers are much slower than floating point and they have a tendancy to occupy larger and larger amounts of memory as the computation continues. Floating point is fast and efficient, but you have to be careful when you use it.

2 comments:

Jason said...

Take an eigenvalue decomposition of the original 2x2 matrix. You will see that the smallest eigenvalue is well below the ~16 digits of relative precision of IEEE double floating point. Effectively singular in that representation so the result is as expected.

Keshava Prasad Halemane said...

If you check back whether your answer obtained through rational arithmetic does indeed solve the problem, it doesn't - multiplying the coefficient matrix with the solution vector gives a product which is [0 0] and not [1 0] as required in the problem given by the rhs vector.