Saturday, April 25, 2009

You knew I'd say something.

In a recent blog post, Guido van Rossum gave his rationale for not wanting tail recursion in Python. From the nature of his arguments, it seems he has several misconceptions about tail recursion. I have to admit spending a bit of time thinking up snarky remarks, but I found that some friends of mine — programmers that I respect a great deal — shared some of these misconceptions. I already know I'm not going to change van Rossum's mind, and I'm preaching to the choir for most of the people who read this, but maybe I can clear up these misconceptions for one or two people.

Will Clinger's paper Proper tail recursion and space efficiency in the Proceedings of the 1998 ACM Conference on Programming Language Design and Implementation, June 1998, pages 174-185, is the best reference I've found on the topic.

First of all, let's be clear about what ‘proper tail recursion’ is. Clinger defines it this way:
The essence of proper tail recursion is that a procedure can return by performing a tail call to any procedure, incuding itself.
It is important to recognize that we're not just talking about loops written in tail-recursive style, but any procedure that (tail) calls another in order to compute the return value.

It is a common misconception that tail recursion is primarily a syntactic sleight-of-hand that allows a programmer to use a function call rather than a traditional iteration construct. A Scheme programmer might write this:
(define (factorial x)
  (define (fact-iter n answer)
    (if (> n x)
        (fact-iter (+ n 1) (* answer n))))
  (fact-iter 1 1))

Where a more traditional formulation is this:
int factorial (int x)
    int answer = 1;
    for (int n = 1; n <= x; n++)
      answer *= n;
    return answer;

The Scheme programmer relies on the compiler to recognize that the recursive call to fact-iter is intended to be a loop. He expects that the code will compile to something very close to the traditional formulation. The MIT-Scheme compiler generates this code:
;; Note: some instructions have been omitted for clarity
;; (tagging, interrupt polling, etc.)
        ;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (assign (register #x12) (offset (register 4) (machine-constant 2)))
        (mov w (r 1) (@ro b 4 8))
        ;; (fixnum-pred-2-args greater-than-fixnum? (register #x11) (register #x13))
        (cmp w (r 0) (r 1))
        (jg (@pcr label-6))

        ;; (assign (register #x15) (offset (register 4) (machine-constant 1)))
        (mov w (r 1) (@ro b 4 4))

        ;; (assign (register #x19) (fixnum-2-args multiply-fixnum (register #x16) (register #x11) #f))
        (imul w (r 1) (r 0))

        ;; (assign (offset (register 4) (machine-constant 1)) (register #x14))
        (mov w (@ro b 4 4) (r 1))
        ;; (assign (register #x1d) (fixnum-1-arg one-plus-fixnum (register #x11) #f))
        (add w (r 0) (& #x5a))

        ;; (assign (offset (register 4) (machine-constant 0)) (register #x1a))
        (mov w (@r 4) (r 0))
        ;; (invocation:jump 2 #f fact-iter-3)
        (jmp (@pcr fact-iter-3))

As you can see, the compiler dutifully turned it into a loop.

Not everyone is aware that this works across procedure boundaries.
(define (even? x)
  (if (= x 0)
      (odd? (- x 1))))

(define (odd? x)
  (even? (- x 1)))
Yeah, I blew it here by forgetting to test for
a base case.  But the conditional wasn't the interesting part anyway.

;; Note: some instructions have been omitted for clarity
        ;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (eq-test (register #x10) (constant 0))
        (cmp w (r 0) (&u #x68000000))
        (je (@pcr label-4))
        ;; (lap-opt fixnum-add-const-in-place)
        ;; (assign (register #x14) (object->fixnum (register #x10)))
        ;; (assign (register #x15) (fixnum-1-arg minus-one-plus-fixnum (register #x14) #f))
        ;; (assign (register #x12) (fixnum->object (register #x15)))
        (add w (r 0) (& #x3ffffff))
        (and w (r 0) (& #x6bffffff))
        ;; (assign (offset (register 4) (machine-constant 0)) (register #x12))
        (mov w (@r 4) (r 0))
        ;; (invocation:uuo-link 2 #f odd?)
        (jmp odd?-1)
        ;; (assign (offset (register 6) (machine-constant 2)) (constant #t))
        (mov w (@ro b 6 8) (&u #x20000000))
        ;; (pop-return)

        ;; (assign (register #x11) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (lap-opt fixnum-add-const-in-place)
        ;; (assign (register #x12) (object->fixnum (register #x11)))
        ;; (assign (register #x13) (fixnum-1-arg minus-one-plus-fixnum (register #x12) #f))
        ;; (assign (register #x10) (fixnum->object (register #x13)))
        (add w (r 0) (& #x3ffffff))
        (and w (r 0) (& #x6bffffff))
        ;; (assign (offset (register 4) (machine-constant 0)) (register #x10))
        (mov w (@r 4) (r 0))
        ;; (invocation:uuo-link 2 #f even?)
        (jmp even?-2)
And people have pointed out that this may be surprising and baffling to someone who expects to see the history of the computation laid out on the stack.

Scheme programmers make a big deal out of this. They point out that with proper tail recursion, Scheme doesn't need looping constructs like while or for or do because the user can write them himself. But Scheme hackers are a fringe group of the archetypical fringe group: Lisp hackers. They enjoy removing features from the language that they consider ‘unnecessary’. Then they write papers that somehow ‘prove’ that this is better. Of course these academic papers contain an awful lot of math and not very many actual programs.

In fact, Scheme programmers will try to sneak tail recusion into other programming languages. There's hardly a point to doing so because other languages have loops already. But now you get programs written by wet-behind-the-ears neophytes that use tail recursion rather than something more obvious. Furthemore, you cannot debug this code because the tail recursion has erased the backtrace.

And what does this buy you? In theory, when you exit a procedure with a tail-recursive call, you can replace the call/return instruction pair with a jump. In theory, that's not much of an improvement. In practice, however, it isn't quite that simple. The stack usually has to be adjusted one way or the other and this completely wipes out any benefit to be seen from avoiding the push and pop of the return address.

To summarize, this point of view about tail recursion is based on these ideas:
  • Its purpose is to avoid writing looping constructs, which are somehow considered ‘bad’ by ‘purists’. These weirdos ‘think’ in loops, then transform the code to be ‘recursive with tail calls’ when they write it, and then expect the compiler transform it back. This is academic mental gymnastics with no practical purpose.
  • People who like tail recursion are purposefully burdening their code in the name of ‘mathematical’ or ‘theoretical’ purity, but they expect the language implementor to bear the weight.
  • Any code that depends on tail recursion can easily be rewritten to use a looping construct instead. Such a rewrite is a trivial local transformation.
  • Inter-procedural tail recursion may have some performance benefit, but such benefit is very small at best, and the tradeoff is the loss of valuable debugging information.
van Rossum's blog post almost screams each of these points.

If these were in fact the key issues to be argued, then there would be little more to say beyond de gustibus non est disputandum. But these arguments are a sideshow. The true argument about tail recursion is that it reduces the space complexity class of a program and allows a program access to a more powerful computation model. This argument is subtle.

Let me digress for a moment. Remember the Bubble Sort algorithm? If I were to write some code that used Bubble Sort to maintain the order of elements in a set, I would be lucky if I only got laughed at. The Hacker's Jargon file calls Bubble Sort `the generic bad algorithm'.

What makes Bubble Sort so bad? It is extremely inefficient. On the average, Bubble Sort performs O(n^2) operations to sort a set of n elements. The Big-O notation is the asymptotic time complexity of the algorithm. O(n^2) means that when the value of n is large enough, the number of operations grows proportional to the square of the number of elements. So if we were to Bubble Sort a set of one thousand objects, we might expect the number of operations to be in the millions. If our set grew to ten thousand objects, the amount of time to sort it would grow by a hundredfold.

Let's compare this to mergesort. Mergesort has an asymptotic time complexity of O(n log n). A mergesort of one thousand objects would take something like 1000 * 4 operations. This is much less than the millions required for a Bubble Sort of the same set. If we grew the set to ten thousand items, the amount of time to sort it would only grow by a factor of a little more than ten. Certainly far less than the hundredfold growth we see in Bubble Sort.

Bubble Sort is bad because it wastes a valuable resource: time. And when you let the problem grow larger, it wastes a disproportionately larger amount of time compared to other, more efficient algorithms.

When you determine a strategy to solve a problem, you take a close look at the complexity classes of your algorithms. You'd like a solution within a reasonable amount of time, and you'd like to avoid using an unreasonable amount of space (whether memory or disk). Picking an inefficient algorithm for some minor part of your program can have a major impact on the resource usage of your program. Even the small change from O(n^2) to O(n log n) makes a huge difference.

Asymptotic time complexity is not the only measure of an algorithms efficiency. Space (memory usage) can be as much of a limiting factor as time, so asymptotic space complexity should be considered. Consider the diff program that compares two files. The naive implementation of diff using dynamic programming requires a rectangular tableau with a size that is the product of the length of the two files being compared. Diffing a pair of fairly large files in this way will exhaust memory. A tradeoff can be made where multiple passes are needed to complete the diff, but the space complexity, rather than being O(n*m) is O(n + m). It takes longer, but now truly enormous files can be diffed without running out of memory.

So what does this have to do with tail recursion?

A program describes a process. An implementation takes a program and performs the process the program describes. The process that emerges has a space and time complexity that is characteristic of both the implementation and the program.

Programmers usually expect to be able to ignore the implementation's contribution to the time and space complexity of a problem. A programmer would be surprised if he had coded up a mergesort with the expected O(n log n) asymptotic time complexity but found that the actual asymptotic time complexity on his computer was O(n^2). He would be understandably upset if an obviously linear text search required O(n^2) asymptotic space complexity in the size of the corpus.

To quote Clinger:
Proper tail recursion yields a well-defined asymptotic space complexity that can be relied upon by portable programs.
Among these [space consumption] properties of an implementation, proper tail recursion is especially important because it is easy to construct programs that should run in small constant space, but are likely to require at least linear space in implementations that are not properly tail recursive.

That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in O(1) space is actually taking O(n) space or even worse.

How could this be?

Consider this very simple program:
(define (foo x)
  (baz (bar x)))
In a non tail recursive implementation, the call to baz would return back to foo, which would then return to foo's caller. In a tail recursive implementation, foo would directly jump to baz and baz would return directly to foo's caller. The difference is that foo releases its resources before transferring to baz in the tail recursive case, and after baz's computation in the non tail recursive case. There are two resources of interest. The obvious one is the stack frame itself. The less obvious one is the reference to the value of X.

This is a very important. Code like the above is typical where one is performing a computation that involves part of a data structure. The call to bar traverses X to find the relevant sub-structure to operate upon. Once this substructure is located, it is no longer necessary to retain the enclosing structure X. The tail recursive implementation immediately discards the reference to X and allows the garbage collector to reclaim the resources. The non tail recursive implementation unnecessarily retains X during the computation of baz. This is not simply an inconvenience. If X holds substantial resources, we may run out during baz. If baz somehow depends on a recursive call to foo, we'll find our space complexity is not a function of the relevant information sitting in the bar portion of X, but a function of X itself, the container of our information. If we write a program that processes text sentence by sentence, we'd be surprised to find that the implementation uses space proportional to the entire document!

If one objects to Bubble Sort on the grounds of its poor asymptotic time complexity compared to better algorithms, then one clearly should object to a lack of tail recursion. It has even worse asymptotic space complexity, and it acts upon the entire program. (At least Bubble Sort doesn't slow down your program if you don't sort anything.)

Part II on its way....

(By the way, if you have questions or need some clarification on what I've said so far, please comment or email me.)