In a recent
blog post,
Guido van Rossum gave his rationale for not wanting
tail recursion in
Python. From the nature of his arguments, it seems
he has several misconceptions about tail recursion. I have to admit
spending a bit of time thinking up snarky remarks, but I found that
some friends of mine — programmers that I respect a great deal
— shared some of these misconceptions. I already know I'm not
going to change van Rossum's mind, and I'm preaching to the choir
for most of the people who read this, but maybe I can clear up these
misconceptions for one or two people.
Will Clinger's paper
Proper tail recursion and space efficiency in the Proceedings of the 1998 ACM Conference on Programming Language Design and Implementation, June 1998, pages 174-185, is the best reference I've found on the topic.
First of all, let's be clear about what ‘proper tail
recursion’ is. Clinger defines it this way:
The essence of proper tail recursion is that a procedure can return by
performing a tail call to any procedure, incuding itself.
It is important to recognize that we're not just talking about loops
written in tail-recursive style, but
any procedure that (tail) calls
another in order to compute the return value.
It is a common misconception that tail recursion is primarily a
syntactic sleight-of-hand that allows a programmer to use a function
call rather than a traditional iteration construct. A Scheme
programmer might write this:
(define (factorial x)
(define (fact-iter n answer)
(if (> n x)
answer
(fact-iter (+ n 1) (* answer n))))
(fact-iter 1 1))
Where a more traditional formulation is this:
int factorial (int x)
{
int answer = 1;
for (int n = 1; n <= x; n++)
answer *= n;
return answer;
}
The Scheme programmer relies on the compiler to recognize that the
recursive call to fact-iter is intended to be a loop. He expects that
the code will compile to something very close to the traditional
formulation. The MIT-Scheme compiler generates this code:
;; Note: some instructions have been omitted for clarity
;; (tagging, interrupt polling, etc.)
fact-iter-3:
;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (assign (register #x12) (offset (register 4) (machine-constant 2)))
(mov w (r 1) (@ro b 4 8))
;; (fixnum-pred-2-args greater-than-fixnum? (register #x11) (register #x13))
(cmp w (r 0) (r 1))
(jg (@pcr label-6))
;; (assign (register #x15) (offset (register 4) (machine-constant 1)))
(mov w (r 1) (@ro b 4 4))
;; (assign (register #x19) (fixnum-2-args multiply-fixnum (register #x16) (register #x11) #f))
(imul w (r 1) (r 0))
;; (assign (offset (register 4) (machine-constant 1)) (register #x14))
(mov w (@ro b 4 4) (r 1))
;; (assign (register #x1d) (fixnum-1-arg one-plus-fixnum (register #x11) #f))
(add w (r 0) (& #x5a))
;; (assign (offset (register 4) (machine-constant 0)) (register #x1a))
(mov w (@r 4) (r 0))
;; (invocation:jump 2 #f fact-iter-3)
(jmp (@pcr fact-iter-3))
As you can see, the compiler dutifully turned it into a loop.
Not everyone is aware that this works across procedure boundaries.
(define (even? x)
(if (= x 0)
#t
(odd? (- x 1))))
(define (odd? x)
(even? (- x 1)))
Yeah, I blew it here by forgetting to test for
a base case. But the conditional wasn't the interesting part anyway.
;; Note: some instructions have been omitted for clarity
;;
even?-2:
;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (eq-test (register #x10) (constant 0))
(cmp w (r 0) (&u #x68000000))
(je (@pcr label-4))
;; (lap-opt fixnum-add-const-in-place)
;; (assign (register #x14) (object->fixnum (register #x10)))
;; (assign (register #x15) (fixnum-1-arg minus-one-plus-fixnum (register #x14) #f))
;; (assign (register #x12) (fixnum->object (register #x15)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x12))
(mov w (@r 4) (r 0))
;; (invocation:uuo-link 2 #f odd?)
(jmp odd?-1)
label-4:
;; (assign (offset (register 6) (machine-constant 2)) (constant #t))
(mov w (@ro b 6 8) (&u #x20000000))
;; (pop-return)
(return)
odd?-1:
;; (assign (register #x11) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (lap-opt fixnum-add-const-in-place)
;; (assign (register #x12) (object->fixnum (register #x11)))
;; (assign (register #x13) (fixnum-1-arg minus-one-plus-fixnum (register #x12) #f))
;; (assign (register #x10) (fixnum->object (register #x13)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x10))
(mov w (@r 4) (r 0))
;; (invocation:uuo-link 2 #f even?)
(jmp even?-2)
And people have pointed out that this may be surprising and baffling
to someone who expects to see the history of the computation laid out
on the stack.
Scheme programmers make a big deal out of this. They point out that
with proper tail recursion, Scheme doesn't need looping constructs
like
while
or
for
or
do
because
the user can write them himself. But Scheme hackers are a fringe
group of the archetypical fringe group: Lisp hackers. They enjoy
removing features from the language that they consider
‘unnecessary’. Then they write papers that somehow
‘prove’ that this is better. Of course these academic
papers contain an awful lot of math and not very many actual
programs.
In fact, Scheme programmers will try to sneak tail recusion into
other programming languages. There's hardly a point to doing so
because other languages
have loops already. But now you get
programs written by wet-behind-the-ears neophytes that use tail
recursion rather than something more obvious. Furthemore, you cannot
debug this code because the tail recursion has erased the
backtrace.
And what does this buy you? In theory, when you exit a procedure with
a tail-recursive call, you can replace the
call/return
instruction pair with a
jump
. In theory, that's not much
of an improvement. In practice, however, it isn't quite that simple.
The stack usually has to be adjusted one way or the other and this
completely wipes out any benefit to be seen from avoiding the
push
and
pop
of the return address.
To summarize, this point of view about tail recursion is based on
these ideas:
- Its purpose is to avoid writing looping constructs, which are
somehow considered ‘bad’ by ‘purists’. These weirdos ‘think’ in
loops, then transform the code to be ‘recursive with tail calls’
when they write it, and then expect the compiler transform it
back. This is academic mental gymnastics with no practical purpose.
- People who like tail recursion are purposefully burdening their
code in the name of ‘mathematical’ or ‘theoretical’ purity, but
they expect the language implementor to bear the weight.
- Any code that depends on tail recursion can easily be rewritten
to use a looping construct instead. Such a rewrite is a trivial
local transformation.
- Inter-procedural tail recursion may have some performance
benefit, but such benefit is very small at best, and the
tradeoff is the loss of valuable debugging information.
van Rossum's blog post almost screams each of these points.
If these were in fact the key issues to be argued, then there would be
little more to say beyond
de gustibus non est disputandum. But these
arguments are a sideshow. The true argument about tail recursion is
that it reduces the space complexity class of a program and allows a
program access to a more powerful computation model. This argument is
subtle.
Let me digress for a moment. Remember the Bubble Sort algorithm? If
I were to write some code that used Bubble Sort to maintain the order
of elements in a set, I would be lucky if I only got laughed at. The
Hacker's Jargon file calls Bubble Sort `the generic
bad
algorithm'.
What makes Bubble Sort so bad? It is extremely inefficient. On the
average, Bubble Sort performs
O(n^2)
operations to sort a set of n
elements. The Big-O notation is the asymptotic time complexity of the
algorithm.
O(n^2)
means that when the value of
n
is large enough, the
number of operations grows proportional to the square of the number of
elements. So if we were to Bubble Sort a set of one thousand objects,
we might expect the number of operations to be in the millions. If
our set grew to ten thousand objects, the amount of time to sort it
would grow by a hundredfold.
Let's compare this to mergesort. Mergesort has an asymptotic time
complexity of
O(n log n)
. A mergesort of one thousand objects would
take something like
1000 * 4
operations. This is much less than the
millions required for a Bubble Sort of the same set. If we grew the
set to ten thousand items, the amount of time to sort it would only
grow by a factor of a little more than ten. Certainly far less than
the hundredfold growth we see in Bubble Sort.
Bubble Sort is bad because it wastes a valuable resource: time. And when you let the
problem grow larger, it wastes a disproportionately larger amount of
time compared to other, more efficient algorithms.
When you determine a strategy to solve a problem, you take a close
look at the complexity classes of your algorithms. You'd like a
solution within a reasonable amount of time, and you'd like to avoid
using an unreasonable amount of space (whether memory or disk).
Picking an inefficient algorithm for some minor part of
your program can have a major impact on the resource usage of your
program. Even the small change from
O(n^2)
to
O(n log n)
makes a huge
difference.
Asymptotic time complexity is not the only measure of an algorithms
efficiency. Space (memory usage) can be as much of a limiting factor as time, so
asymptotic space complexity should be considered. Consider the
diff
program that compares two files. The naive implementation of
diff
using dynamic programming requires a rectangular tableau with a size
that is the product of the length of the two files being compared.
Diff
ing a pair of fairly
large files in this way will exhaust memory. A tradeoff can be made
where multiple passes are needed to complete the
diff
, but the space
complexity, rather than being
O(n*m)
is
O(n + m)
. It takes longer,
but now truly enormous files can be diffed without running out of memory.
So what does this have to do with tail recursion?
A program describes a process. An implementation takes a program and
performs the process the program describes. The process that emerges
has a space and time complexity that is characteristic of
both the implementation and the program.
Programmers usually expect to be able to ignore the implementation's
contribution to the time and space complexity of a problem. A
programmer would be surprised if he had coded up a mergesort with the
expected
O(n log n)
asymptotic time complexity but found that the
actual asymptotic time complexity on his computer was
O(n^2)
. He
would be understandably upset if an obviously linear text search
required
O(n^2)
asymptotic space complexity in the size of the
corpus.
To quote Clinger:
Proper tail recursion yields a well-defined asymptotic space
complexity that can be relied upon by portable programs.
and
Among these [space consumption] properties of an implementation,
proper tail recursion is especially important because it is easy to
construct programs that should run in small constant space, but are
likely to require at least linear space in implementations that are
not properly tail recursive.
That is to say, if your implementation is not properly tail recursive,
you will often find that a program that ought to run in
O(1)
space is
actually taking
O(n)
space
or even worse.
How could this be?
Consider this very simple program:
(define (foo x)
(baz (bar x)))
In a non tail recursive implementation, the call to
baz
would return
back to
foo
, which would then return to
foo
's caller. In a tail
recursive implementation,
foo
would directly jump to
baz
and
baz
would
return directly to
foo
's caller. The difference is that
foo
releases
its resources
before transferring to
baz
in the tail recursive case,
and
after baz
's computation in the non tail recursive case. There
are two resources of interest. The obvious one is the stack frame
itself. The less obvious one is the reference to the value of
X
.
This is a very important. Code like the above is typical where one is
performing a computation that involves part of a data structure. The
call to
bar
traverses
X
to find the relevant sub-structure to operate
upon. Once this substructure is located, it is no longer necessary to
retain the enclosing structure
X
. The tail recursive implementation
immediately discards the reference to
X
and allows the garbage
collector to reclaim the resources. The non tail recursive
implementation unnecessarily retains
X
during the computation of
baz
.
This is not simply an inconvenience. If
X
holds substantial
resources, we may run out during
baz
. If
baz
somehow depends on a
recursive call to
foo
, we'll find our space complexity is not a
function of the relevant information sitting in the
bar
portion of
X
,
but a function of
X
itself, the
container of our information. If we
write a program that processes text sentence by sentence, we'd be
surprised to find that the implementation uses space proportional to
the entire document!
If one objects to Bubble Sort on the grounds of its poor asymptotic
time complexity compared to better algorithms, then one clearly should
object to a lack of tail recursion. It has even worse asymptotic
space complexity, and it acts upon the entire program. (At least
Bubble Sort doesn't slow down your program if you don't sort anything.)
Part II on its way....
(By the way, if you have questions or need some clarification on what
I've said so far, please comment or email me.)