## Saturday, April 25, 2009

### You knew I'd say something.

In a recent blog post, Guido van Rossum gave his rationale for not wanting tail recursion in Python. From the nature of his arguments, it seems he has several misconceptions about tail recursion. I have to admit spending a bit of time thinking up snarky remarks, but I found that some friends of mine — programmers that I respect a great deal — shared some of these misconceptions. I already know I'm not going to change van Rossum's mind, and I'm preaching to the choir for most of the people who read this, but maybe I can clear up these misconceptions for one or two people.

Will Clinger's paper Proper tail recursion and space efficiency in the Proceedings of the 1998 ACM Conference on Programming Language Design and Implementation, June 1998, pages 174-185, is the best reference I've found on the topic.

First of all, let's be clear about what ‘proper tail recursion’ is. Clinger defines it this way:
The essence of proper tail recursion is that a procedure can return by performing a tail call to any procedure, incuding itself.
It is important to recognize that we're not just talking about loops written in tail-recursive style, but any procedure that (tail) calls another in order to compute the return value.

It is a common misconception that tail recursion is primarily a syntactic sleight-of-hand that allows a programmer to use a function call rather than a traditional iteration construct. A Scheme programmer might write this:
```(define (factorial x)
(if (> n x)
(fact-iter (+ n 1) (* answer n))))
(fact-iter 1 1))
```

Where a more traditional formulation is this:
```int factorial (int x)
{
for (int n = 1; n <= x; n++)
}
```

The Scheme programmer relies on the compiler to recognize that the recursive call to fact-iter is intended to be a loop. He expects that the code will compile to something very close to the traditional formulation. The MIT-Scheme compiler generates this code:
```;; Note: some instructions have been omitted for clarity
;; (tagging, interrupt polling, etc.)
fact-iter-3:
;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (assign (register #x12) (offset (register 4) (machine-constant 2)))
(mov w (r 1) (@ro b 4 8))
;; (fixnum-pred-2-args greater-than-fixnum? (register #x11) (register #x13))
(cmp w (r 0) (r 1))
(jg (@pcr label-6))

;; (assign (register #x15) (offset (register 4) (machine-constant 1)))
(mov w (r 1) (@ro b 4 4))

;; (assign (register #x19) (fixnum-2-args multiply-fixnum (register #x16) (register #x11) #f))
(imul w (r 1) (r 0))

;; (assign (offset (register 4) (machine-constant 1)) (register #x14))
(mov w (@ro b 4 4) (r 1))
;; (assign (register #x1d) (fixnum-1-arg one-plus-fixnum (register #x11) #f))
(add w (r 0) (& #x5a))

;; (assign (offset (register 4) (machine-constant 0)) (register #x1a))
(mov w (@r 4) (r 0))
;; (invocation:jump 2 #f fact-iter-3)
(jmp (@pcr fact-iter-3))
```

As you can see, the compiler dutifully turned it into a loop.

Not everyone is aware that this works across procedure boundaries.
```(define (even? x)
(if (= x 0)
#t
(odd? (- x 1))))

(define (odd? x)
(even? (- x 1)))
Yeah, I blew it here by forgetting to test for
a base case.  But the conditional wasn't the interesting part anyway.

;; Note: some instructions have been omitted for clarity
;;
even?-2:
;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (eq-test (register #x10) (constant 0))
(cmp w (r 0) (&u #x68000000))
(je (@pcr label-4))
;; (assign (register #x14) (object->fixnum (register #x10)))
;; (assign (register #x15) (fixnum-1-arg minus-one-plus-fixnum (register #x14) #f))
;; (assign (register #x12) (fixnum->object (register #x15)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x12))
(mov w (@r 4) (r 0))
(jmp odd?-1)
label-4:
;; (assign (offset (register 6) (machine-constant 2)) (constant #t))
(mov w (@ro b 6 8) (&u #x20000000))
;; (pop-return)
(return)

odd?-1:
;; (assign (register #x11) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (assign (register #x12) (object->fixnum (register #x11)))
;; (assign (register #x13) (fixnum-1-arg minus-one-plus-fixnum (register #x12) #f))
;; (assign (register #x10) (fixnum->object (register #x13)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x10))
(mov w (@r 4) (r 0))
(jmp even?-2)
```
And people have pointed out that this may be surprising and baffling to someone who expects to see the history of the computation laid out on the stack.

Scheme programmers make a big deal out of this. They point out that with proper tail recursion, Scheme doesn't need looping constructs like `while` or `for` or `do` because the user can write them himself. But Scheme hackers are a fringe group of the archetypical fringe group: Lisp hackers. They enjoy removing features from the language that they consider ‘unnecessary’. Then they write papers that somehow ‘prove’ that this is better. Of course these academic papers contain an awful lot of math and not very many actual programs.

In fact, Scheme programmers will try to sneak tail recusion into other programming languages. There's hardly a point to doing so because other languages have loops already. But now you get programs written by wet-behind-the-ears neophytes that use tail recursion rather than something more obvious. Furthemore, you cannot debug this code because the tail recursion has erased the backtrace.

And what does this buy you? In theory, when you exit a procedure with a tail-recursive call, you can replace the `call/return` instruction pair with a `jump`. In theory, that's not much of an improvement. In practice, however, it isn't quite that simple. The stack usually has to be adjusted one way or the other and this completely wipes out any benefit to be seen from avoiding the `push` and `pop` of the return address.

To summarize, this point of view about tail recursion is based on these ideas:
• Its purpose is to avoid writing looping constructs, which are somehow considered ‘bad’ by ‘purists’. These weirdos ‘think’ in loops, then transform the code to be ‘recursive with tail calls’ when they write it, and then expect the compiler transform it back. This is academic mental gymnastics with no practical purpose.
• People who like tail recursion are purposefully burdening their code in the name of ‘mathematical’ or ‘theoretical’ purity, but they expect the language implementor to bear the weight.
• Any code that depends on tail recursion can easily be rewritten to use a looping construct instead. Such a rewrite is a trivial local transformation.
• Inter-procedural tail recursion may have some performance benefit, but such benefit is very small at best, and the tradeoff is the loss of valuable debugging information.
van Rossum's blog post almost screams each of these points.

If these were in fact the key issues to be argued, then there would be little more to say beyond de gustibus non est disputandum. But these arguments are a sideshow. The true argument about tail recursion is that it reduces the space complexity class of a program and allows a program access to a more powerful computation model. This argument is subtle.

Let me digress for a moment. Remember the Bubble Sort algorithm? If I were to write some code that used Bubble Sort to maintain the order of elements in a set, I would be lucky if I only got laughed at. The Hacker's Jargon file calls Bubble Sort `the generic bad algorithm'.

What makes Bubble Sort so bad? It is extremely inefficient. On the average, Bubble Sort performs `O(n^2)` operations to sort a set of n elements. The Big-O notation is the asymptotic time complexity of the algorithm. `O(n^2)` means that when the value of `n` is large enough, the number of operations grows proportional to the square of the number of elements. So if we were to Bubble Sort a set of one thousand objects, we might expect the number of operations to be in the millions. If our set grew to ten thousand objects, the amount of time to sort it would grow by a hundredfold.

Let's compare this to mergesort. Mergesort has an asymptotic time complexity of `O(n log n)`. A mergesort of one thousand objects would take something like `1000 * 4` operations. This is much less than the millions required for a Bubble Sort of the same set. If we grew the set to ten thousand items, the amount of time to sort it would only grow by a factor of a little more than ten. Certainly far less than the hundredfold growth we see in Bubble Sort.

Bubble Sort is bad because it wastes a valuable resource: time. And when you let the problem grow larger, it wastes a disproportionately larger amount of time compared to other, more efficient algorithms.

When you determine a strategy to solve a problem, you take a close look at the complexity classes of your algorithms. You'd like a solution within a reasonable amount of time, and you'd like to avoid using an unreasonable amount of space (whether memory or disk). Picking an inefficient algorithm for some minor part of your program can have a major impact on the resource usage of your program. Even the small change from `O(n^2)` to `O(n log n)` makes a huge difference.

Asymptotic time complexity is not the only measure of an algorithms efficiency. Space (memory usage) can be as much of a limiting factor as time, so asymptotic space complexity should be considered. Consider the `diff` program that compares two files. The naive implementation of `diff` using dynamic programming requires a rectangular tableau with a size that is the product of the length of the two files being compared. `Diff`ing a pair of fairly large files in this way will exhaust memory. A tradeoff can be made where multiple passes are needed to complete the `diff`, but the space complexity, rather than being `O(n*m)` is `O(n + m)`. It takes longer, but now truly enormous files can be diffed without running out of memory.

So what does this have to do with tail recursion?

A program describes a process. An implementation takes a program and performs the process the program describes. The process that emerges has a space and time complexity that is characteristic of both the implementation and the program.

Programmers usually expect to be able to ignore the implementation's contribution to the time and space complexity of a problem. A programmer would be surprised if he had coded up a mergesort with the expected `O(n log n)` asymptotic time complexity but found that the actual asymptotic time complexity on his computer was `O(n^2)`. He would be understandably upset if an obviously linear text search required `O(n^2)` asymptotic space complexity in the size of the corpus.

To quote Clinger:
Proper tail recursion yields a well-defined asymptotic space complexity that can be relied upon by portable programs.
and
Among these [space consumption] properties of an implementation, proper tail recursion is especially important because it is easy to construct programs that should run in small constant space, but are likely to require at least linear space in implementations that are not properly tail recursive.

That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in `O(1)` space is actually taking `O(n)` space or even worse.

How could this be?

Consider this very simple program:
```(define (foo x)
(baz (bar x)))
```
In a non tail recursive implementation, the call to `baz` would return back to `foo`, which would then return to `foo`'s caller. In a tail recursive implementation, `foo` would directly jump to `baz` and `baz` would return directly to `foo`'s caller. The difference is that `foo` releases its resources before transferring to `baz` in the tail recursive case, and after `baz`'s computation in the non tail recursive case. There are two resources of interest. The obvious one is the stack frame itself. The less obvious one is the reference to the value of `X`.

This is a very important. Code like the above is typical where one is performing a computation that involves part of a data structure. The call to `bar` traverses `X` to find the relevant sub-structure to operate upon. Once this substructure is located, it is no longer necessary to retain the enclosing structure `X`. The tail recursive implementation immediately discards the reference to `X` and allows the garbage collector to reclaim the resources. The non tail recursive implementation unnecessarily retains `X` during the computation of `baz`. This is not simply an inconvenience. If `X` holds substantial resources, we may run out during `baz`. If `baz` somehow depends on a recursive call to `foo`, we'll find our space complexity is not a function of the relevant information sitting in the `bar` portion of `X`, but a function of `X` itself, the container of our information. If we write a program that processes text sentence by sentence, we'd be surprised to find that the implementation uses space proportional to the entire document!

If one objects to Bubble Sort on the grounds of its poor asymptotic time complexity compared to better algorithms, then one clearly should object to a lack of tail recursion. It has even worse asymptotic space complexity, and it acts upon the entire program. (At least Bubble Sort doesn't slow down your program if you don't sort anything.)

Part II on its way....

(By the way, if you have questions or need some clarification on what I've said so far, please comment or email me.)

Reid K said...

I think you've written a really good summary of the reasons Guido doesn't like tail recursion. It communicates his view even better than he does.

However, I feel like you could address your concern with memory usage in this article (I'm expecting other reasons part II) with a simpler optimization that doesn't sacrifice debugging. Just do variable liveness analysis and notice that the variable isn't used after the call, so you can drop its reference before doing the call. Presumably, when you pushed a copy of the reference onto the stack as an argument, so it won't get garbage collected before baz gets done using it.

Unknown said...

If I understand you right, you claim that proper tail recursion reduces the space complexity, but what you fail to say is *compared to recursion without proper tail calls*.

Kind of a nice trick of the semantics there, since you seem to be comparing proper tail recursion to simple loops, but you're really comparing it to poorly implemented functional languages without proper tail recursion. Simple loops also have the same space requirements as tail calls, so your stated benefits don't apply.

Given that the real comparison of interest is efficiency *and usability* of tail recursion vs. simple looping constructs, I believe GvR's statements win.

Finally, Appel proved that SSA is the same as CPS and that since any language can be translated to either, that means you can give a user any syntax that works for them and still give them any stated benefits you claim for functional programming languages.

Since that's the case, the real question isn't the efficiency of one language over another (they are all the same), it is actually whether one language is more usable than another.

And the problem of language usability isn't solved by O() notation.

Unknown said...

minor error in the factorial function:

(define (factorial x)
(if (> n x)
(fact-iter (+ n 1) (* answer n))))
(fact-iter __x__ 1))

... looking forward to part 2.

Joe Marshall said...

Reid: You can certainly delete the reference to X when you notice it is no longer ‘live’. In fact, when you call out to bar you can delete all the references foo has (it's easy to see they won't be needed again). We can go one step further and delete the return address as well because we don't need that, either.
But if we apply this strategy uniformly, we find that we just implemented proper tail recursion. Proper tail recursion is simply releasing the resources as soon as you are done with them rather than waiting until the callee finishes.

grant rettke said...

Where did you get those quotes from Clinger?

grant rettke said...

The "big deal" here is also in terms of usability. Tail-calls allow the programmer to think in terms of functions that do their work and "GOTO" the next place to work. It is a natural concept that is disallowed by languages without tail-calls.

End of the world? No.

Limitation? Yes

Joe Marshall said...

Zedshaw: You are correct that simple loops have the same asymptotic space utilization as the equivalent tail recursive version. However, not all tail recursive programs can be easily converted to a simple loop.
I have no objection to looping constructs per se.

Joe Marshall said...

Notaddicted: Look carefully, the code I posted is correct.

Joe Marshall said...

Grant: I cornered Will in a dark alley and forced them out of him. (Actually, I got them from the paper I referenced at the top of the post.)

Ethan Herdrick said...

What I think stood out most as mistaken about Guido's post was his conflating recursion and recursive data structures in his third point. "For practical purposes, Python-style lists (which are flexible arrays, not linked lists), and sequences in general, are much more useful to start exploring the wonderful world of programming than recursion... Most of Python's library is written with sequences and iterators as fundamental building blocks (and dictionaries, of course), not linked lists, so you'd be locking yourself out of a lot of pre-defined functionality by not using lists or sequences." Well, chained cons cells are lists, too, and recursive functions are good for much more than operating over recursive data structures. Recursion is often the only simple solution in messy real world stuff. Just last week I wrote recursive code to do one of the most real-worldy and least Computer Science-y thing imaginable: get data from a public API. I didn't see any reasonable alternative to recursion. (But since that project was, sadly, not written in a language with TCO I am stuck with a function that's going to break if our little Twitter app gets popular, at which point I'll have to do unreasonable hack. Or maybe I'll port it to Scheme instead.)

But I admire Guido and could be misinterpreting what he said.

Pierre said...

Your even?/odd? example is not complete: (odd? 0) never returns.

It is a nice example of doubly tail recursive functions, but is quite easy to get wrong.

Jonathan Allen said...

> That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in O(1) space is actually taking O(n) space or even worse.

I would like to see this claim proven.

In Windows the stack space is pre-allocated. Short of actually overflowing the stack, it doesn't usually matter how much you use.

Jens Axel SÃ¸gaard said...

Excellent article.

Peter said...

In Windows the stack space is pre-allocated. Short of actually overflowing the stack, it doesn't usually matter how much you use.This is untrue, though it looks that way to a user program. Modern operating systems keep most of the stack's potential size unallocated until the stack grows and throws a page fault. This makes creating new processes fast and saves space (at the cost of a jump into kernel code every now and then).

Jonathan Allen said...

By default, Windows programs request that stack space be committed when it is allocated.

You can override this and allocate stack space without committing it, but that's pretty rare.

Peter said...

Perhaps I'm mistaken then, it's been a while since I've done windows development. However, my reading of this MSDN page is that Windows doesn't allocate any space in physical memory for the stack until it is referenced.

Sean McQuillan said...

Zed,

I feel that there is a disconnect here. If I re-read the original post, I believe the example he is correctly citing would more properly be called inductive programming rather than recursive programming.

Coming up with a "real world" example is putting me at a loss offhand, but some of the data structures I've seen come out of genetics might qualify :). I'll assume for the rest of this post that I for whatever reason want to process an acyclic graph with nodes that contain lists that may point back into the graph.

In my example, it should be pretty obvious that no realistic loop short of a few hundred lines is going to handle both the graph structure, the list structure, and the jump from lists back into the graph. However, a fairly trivial inductive algorithm can handle this data structure cleanly.

With proper tail recursion I can expect O(1) space requirements for my stack while executing my algorithm. Without tail recursion I can expect to cry when trying to analyze the complexity, so I shaln't, but it's obviously going to be something nasty and much much larger than O(1).

I think this as a motivating example for TCO is fairly solid, and probably a much better selling point than "I can write fib with O(n) runtime!"

That said, as you noted, everything can be rewritten in CPS by hand and TCO can be unrolled manually. However, at that point we're just as well arguing that everything is better in assembly?

PS: I'm not saying I really disagree with you here, I just think you missed the motivating example and how it was not explicitly recursive.

Sean McQuillan said...
This comment has been removed by the author.
Unknown said...

I agree with Reid, yoru summary of the reasons is good. One important point is that implementing tail recursion in Python pretty much changes the language, so programmers will write code depending on it. It does affects the O(N) properties of programming, and my understanding of Guido's comments is simply he doesn't like that style, and that Python is none the worse for it. I don't see why Python should be all thing to all people, and I do hope that Python does not catch such a case of C++ivitis at some point.

Max Bolingbroke said...

Zed,

It's certainly true that you could explicitize the TCO by rewriting all of your code in CPS. However:

a) To implement CPS in the Haskell style (as a monad) you /NEED TCO/ to avoid stack overflow!
b) If your language provides CPS with TCO built-in as a primitive then your scheme will work, but rewriting your code in this manner is still highly boring :-)

In fact, Appels' embedding of SSA into CPS relies crucially on continuation-entering not requiring any stack operations!

BTW, I work on a Haskell compiler (GHC), and we do both TCO /and/ treat certainly-dead variables as not being roots for the GC. I.e. TCO alone is not in general sufficient to realise all of the benefits discussed in the post.

Having said all this, I'm not sure TCO is suitable for OO languages like Python. If you do things like inject "aspects" around your function calls - which can happen quite a lot when metaprogramming - then things that look like tailcalls can end up needing to push a stack frame, changing your asymptotic space behaviour.

The loss of debugging info is also a blow :(. In fact, in Haskell if you get an exception there is no stack trace AT ALL! This is because in a lazy language in a sense every call is a tail call.

Cheers,
Max

Harold Fowler said...

LOL, thats pretty good dude!

RT
www.anonymity.es.tc

Joe Lynch said...

Hey, nice article and this is no big deal, but your time complexity analysis is a tad misleading. When you said merge sort is O(n log(n)) you said as an example that 1000 data points would connote ~1000*4 operations, but it is actually more around 1000 * 10 because the log is base 2. This is still significantly better than Bubble sort and such, just wanted you to be correct.

Unknown said...

You seem to be saying that stack space is an 'implementation detail' that an implementer is not expected to know about, and can increase the space complexity of an algorithm. However, if that's the case, merely supporting tail calls isn't going to fix it - any other recursion is going to run into the same issue.

Further, one (significant one) of Guido's points remains completely untouched by your argument: There are many implementations of Python. Either they all have to implement tail recursion (and how likely is that?), or implementers won't be able to rely on tail recursion being present, in which case is there any point in having it?

grant rettke said...

Arachnid: Tail-calls would either be part of the language standard or not, and all implementations which conform would implement it.

A multitude of Scheme implementations do as such.

constance eustace said...

I could be wrong, but I am almost 100% certain that tail-recursive calls are convertable to loops and back without hacks like implementing your own stack.

Any Recursive algorithm can be converted to a loops if you implement a stack, and any loop can be implemented as a recursion.

A space-constant tail-recursive call would certainly convert to a loop that doesn't need a local stack. Because the tail-recursive version doesn't need one.

And Zed Shaw knocked it out of the park.

hacksoncode said...

Purely in the spirit of "all generalizations are wrong":

Bubble sort is only O(n^2) on randomly sorted lists.

On nearly sorted lists, its performance can exceed that of mergesort, shellsort, and/or quicksort, depending on the exact nature of the mis-sorting.

It's really O(N*avg(distanceToSortedPosition)).

Just a pet peeve, and it's only relation to this debate is that tail-recursion's efficiency gains are highly sensitive to programmer proficiency, and are extremely easy to reverse by careless maintenance.

Guillaume said...

constance eustace said...
I could be wrong, but I am almost 100% certain that tail-recursive calls are convertable to loops and back without hacks like implementing your own stack.There is an entire set of design patterns that are not available in languages that lack tail recursion. They are patterns that cannot be converted into loops without breaking open abstraction boundaries, or without losing compositionality.

It is normal that programmers that have never coded in a language with tail recursion would not be familiar with these pattern, and wouldn't miss them. Inversely, programmers that are familiar with them feel their absence.

This box is too small to give examples of these patterns. They are medium- and large-scale modularity patterns, and so presenting them takes some time. I suggest you read the following two papers:

Automata via Macros,
by Shriram Krishnamurthi, in the
Journal of Functional Programming, 2006

Why Functional Programming Matters, by John Hughes, in the Computer Journal and the Year of Programming, 1990.

hacksoncode said...

I find Krishnamurthi's example of a finite state machine as something that inherently requires tails calls to be... curious at best.

Every procedural programmer I know can easily implement a finite state machine with (using C as an example) an enum, a switch, and a for loop.

Not only that, but this kind of implementation is *far* easier to convert to an asynchronous state machine (i.e. where the tokens being evaluated arrive in sequence separated by other events and by time).

I personally have nothing terribly against TCO (as long as you can turn it off for debugging when needed). But I think we're still looking for an example of an easily-understood killer app for it.

David Vanderson said...

The difference is that foo releases its resources before transferring to bar in the tail recursive case, and after bar's computation in the non tail recursive case.I think you meant that sentence to reference baz, not bar.

Excellent article! I hadn't before thought about non-TCO hanging on to variable references. Thanks!

Constance Eustache: "tail-recursive calls are convertable to loops and back"

Yes, but that's a GLOBAL program transformation from t ail-calls to loops, and if you allow such transformation, INTERCAL is just as good a language as Python.

On the other hand, loops can be quite elegantly macro-expressed in tail-calls (see notably Olin Shivers' article on loops), which is a matter of purely local transformations.

PS: my take on Guido's article: http://fare.livejournal.com/142410.html
http://neopythonic.blogspot.com/2009/04/tail-recursion-elimination.html?showComment=1240873740000#c8956549754909773441

Unknown said...

Bubblesort is not bad because it is slow. It is bad because it is surprising.

Performance (and other) surprises should be avoided. The fact that python consumes stack space for tail calls surprises no-one.

Bubble sort has surprisingly bad performance for large data.

The use of reference counting garbage collection is much more likely could create surprises, perhaps they should fix that?

Unknown said...

FWIW: Your odd? function is missing a base case. The functions can never return #f!

Robert said...

It seems like it’d be better to talk about TCO and leave recursion out of it. The benefits to recursion are a subset of the benefits of TCO.

Just about every language implementer I’ve ever met would be very proud of implementing TCO. It’s hard to imagine one arguing against it.

Although, I suppose Python tends to conflate specification with implementation. I can understand wanting to keep it out of the specification.

Still, it’s hard for me to understand a language specification not wanting to give programmers more tools—balanced against other issues, of course—rather than fewer. Although, that’s Python’s “there’s only one way to do it” philosophy.

Isn’t FarÃ© right: Implementing loops on top of TCO is easier than implementing TCO on top of loops? Scheme’s path of putting TCO in the spec and providing macros to allow looping constructs to be build on top of it in a library seems awfully sensible to me.