Will Clinger's paper Proper tail recursion and space efficiency in the Proceedings of the 1998 ACM Conference on Programming Language Design and Implementation, June 1998, pages 174-185, is the best reference I've found on the topic.
First of all, let's be clear about what ‘proper tail recursion’ is. Clinger defines it this way:
The essence of proper tail recursion is that a procedure can return by performing a tail call to any procedure, incuding itself.It is important to recognize that we're not just talking about loops written in tail-recursive style, but any procedure that (tail) calls another in order to compute the return value.
It is a common misconception that tail recursion is primarily a syntactic sleight-of-hand that allows a programmer to use a function call rather than a traditional iteration construct. A Scheme programmer might write this:
(define (factorial x) (define (fact-iter n answer) (if (> n x) answer (fact-iter (+ n 1) (* answer n)))) (fact-iter 1 1))
Where a more traditional formulation is this:
int factorial (int x) { int answer = 1; for (int n = 1; n <= x; n++) answer *= n; return answer; }
The Scheme programmer relies on the compiler to recognize that the recursive call to fact-iter is intended to be a loop. He expects that the code will compile to something very close to the traditional formulation. The MIT-Scheme compiler generates this code:
;; Note: some instructions have been omitted for clarity ;; (tagging, interrupt polling, etc.) fact-iter-3: ;; (assign (register #x10) (offset (register 4) (machine-constant 0))) (mov w (r 0) (@r 4)) ;; (assign (register #x12) (offset (register 4) (machine-constant 2))) (mov w (r 1) (@ro b 4 8)) ;; (fixnum-pred-2-args greater-than-fixnum? (register #x11) (register #x13)) (cmp w (r 0) (r 1)) (jg (@pcr label-6)) ;; (assign (register #x15) (offset (register 4) (machine-constant 1))) (mov w (r 1) (@ro b 4 4)) ;; (assign (register #x19) (fixnum-2-args multiply-fixnum (register #x16) (register #x11) #f)) (imul w (r 1) (r 0)) ;; (assign (offset (register 4) (machine-constant 1)) (register #x14)) (mov w (@ro b 4 4) (r 1)) ;; (assign (register #x1d) (fixnum-1-arg one-plus-fixnum (register #x11) #f)) (add w (r 0) (& #x5a)) ;; (assign (offset (register 4) (machine-constant 0)) (register #x1a)) (mov w (@r 4) (r 0)) ;; (invocation:jump 2 #f fact-iter-3) (jmp (@pcr fact-iter-3))
As you can see, the compiler dutifully turned it into a loop.
Not everyone is aware that this works across procedure boundaries.
(define (even? x)
(if (= x 0)
#t
(odd? (- x 1))))
(define (odd? x)
(even? (- x 1)))
Yeah, I blew it here by forgetting to test for
a base case. But the conditional wasn't the interesting part anyway.
;; Note: some instructions have been omitted for clarity
;;
even?-2:
;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (eq-test (register #x10) (constant 0))
(cmp w (r 0) (&u #x68000000))
(je (@pcr label-4))
;; (lap-opt fixnum-add-const-in-place)
;; (assign (register #x14) (object->fixnum (register #x10)))
;; (assign (register #x15) (fixnum-1-arg minus-one-plus-fixnum (register #x14) #f))
;; (assign (register #x12) (fixnum->object (register #x15)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x12))
(mov w (@r 4) (r 0))
;; (invocation:uuo-link 2 #f odd?)
(jmp odd?-1)
label-4:
;; (assign (offset (register 6) (machine-constant 2)) (constant #t))
(mov w (@ro b 6 8) (&u #x20000000))
;; (pop-return)
(return)
odd?-1:
;; (assign (register #x11) (offset (register 4) (machine-constant 0)))
(mov w (r 0) (@r 4))
;; (lap-opt fixnum-add-const-in-place)
;; (assign (register #x12) (object->fixnum (register #x11)))
;; (assign (register #x13) (fixnum-1-arg minus-one-plus-fixnum (register #x12) #f))
;; (assign (register #x10) (fixnum->object (register #x13)))
(add w (r 0) (& #x3ffffff))
(and w (r 0) (& #x6bffffff))
;; (assign (offset (register 4) (machine-constant 0)) (register #x10))
(mov w (@r 4) (r 0))
;; (invocation:uuo-link 2 #f even?)
(jmp even?-2)
And people have pointed out that this may be surprising and baffling
to someone who expects to see the history of the computation laid out
on the stack.Scheme programmers make a big deal out of this. They point out that with proper tail recursion, Scheme doesn't need looping constructs like
while
or for
or do
because
the user can write them himself. But Scheme hackers are a fringe
group of the archetypical fringe group: Lisp hackers. They enjoy
removing features from the language that they consider
‘unnecessary’. Then they write papers that somehow
‘prove’ that this is better. Of course these academic
papers contain an awful lot of math and not very many actual
programs.In fact, Scheme programmers will try to sneak tail recusion into other programming languages. There's hardly a point to doing so because other languages have loops already. But now you get programs written by wet-behind-the-ears neophytes that use tail recursion rather than something more obvious. Furthemore, you cannot debug this code because the tail recursion has erased the backtrace.
And what does this buy you? In theory, when you exit a procedure with a tail-recursive call, you can replace the
call/return
instruction pair with a jump
. In theory, that's not much
of an improvement. In practice, however, it isn't quite that simple.
The stack usually has to be adjusted one way or the other and this
completely wipes out any benefit to be seen from avoiding the
push
and pop
of the return address.
To summarize, this point of view about tail recursion is based on these ideas:
- Its purpose is to avoid writing looping constructs, which are somehow considered ‘bad’ by ‘purists’. These weirdos ‘think’ in loops, then transform the code to be ‘recursive with tail calls’ when they write it, and then expect the compiler transform it back. This is academic mental gymnastics with no practical purpose.
- People who like tail recursion are purposefully burdening their code in the name of ‘mathematical’ or ‘theoretical’ purity, but they expect the language implementor to bear the weight.
- Any code that depends on tail recursion can easily be rewritten to use a looping construct instead. Such a rewrite is a trivial local transformation.
- Inter-procedural tail recursion may have some performance benefit, but such benefit is very small at best, and the tradeoff is the loss of valuable debugging information.
If these were in fact the key issues to be argued, then there would be little more to say beyond de gustibus non est disputandum. But these arguments are a sideshow. The true argument about tail recursion is that it reduces the space complexity class of a program and allows a program access to a more powerful computation model. This argument is subtle.
Let me digress for a moment. Remember the Bubble Sort algorithm? If I were to write some code that used Bubble Sort to maintain the order of elements in a set, I would be lucky if I only got laughed at. The Hacker's Jargon file calls Bubble Sort `the generic bad algorithm'.
What makes Bubble Sort so bad? It is extremely inefficient. On the average, Bubble Sort performs
O(n^2)
operations to sort a set of n
elements. The Big-O notation is the asymptotic time complexity of the
algorithm. O(n^2)
means that when the value of n
is large enough, the
number of operations grows proportional to the square of the number of
elements. So if we were to Bubble Sort a set of one thousand objects,
we might expect the number of operations to be in the millions. If
our set grew to ten thousand objects, the amount of time to sort it
would grow by a hundredfold.Let's compare this to mergesort. Mergesort has an asymptotic time complexity of
O(n log n)
. A mergesort of one thousand objects would
take something like 1000 * 4
operations. This is much less than the
millions required for a Bubble Sort of the same set. If we grew the
set to ten thousand items, the amount of time to sort it would only
grow by a factor of a little more than ten. Certainly far less than
the hundredfold growth we see in Bubble Sort.Bubble Sort is bad because it wastes a valuable resource: time. And when you let the problem grow larger, it wastes a disproportionately larger amount of time compared to other, more efficient algorithms.
When you determine a strategy to solve a problem, you take a close look at the complexity classes of your algorithms. You'd like a solution within a reasonable amount of time, and you'd like to avoid using an unreasonable amount of space (whether memory or disk). Picking an inefficient algorithm for some minor part of your program can have a major impact on the resource usage of your program. Even the small change from
O(n^2)
to O(n log n)
makes a huge
difference.Asymptotic time complexity is not the only measure of an algorithms efficiency. Space (memory usage) can be as much of a limiting factor as time, so asymptotic space complexity should be considered. Consider the
diff
program that compares two files. The naive implementation of diff
using dynamic programming requires a rectangular tableau with a size
that is the product of the length of the two files being compared. Diff
ing a pair of fairly
large files in this way will exhaust memory. A tradeoff can be made
where multiple passes are needed to complete the diff
, but the space
complexity, rather than being O(n*m)
is O(n + m)
. It takes longer,
but now truly enormous files can be diffed without running out of memory.So what does this have to do with tail recursion?
A program describes a process. An implementation takes a program and performs the process the program describes. The process that emerges has a space and time complexity that is characteristic of both the implementation and the program.
Programmers usually expect to be able to ignore the implementation's contribution to the time and space complexity of a problem. A programmer would be surprised if he had coded up a mergesort with the expected
O(n log n)
asymptotic time complexity but found that the
actual asymptotic time complexity on his computer was O(n^2)
. He
would be understandably upset if an obviously linear text search
required O(n^2)
asymptotic space complexity in the size of the
corpus.To quote Clinger:
Proper tail recursion yields a well-defined asymptotic space complexity that can be relied upon by portable programs.and
Among these [space consumption] properties of an implementation, proper tail recursion is especially important because it is easy to construct programs that should run in small constant space, but are likely to require at least linear space in implementations that are not properly tail recursive.
That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in
O(1)
space is
actually taking O(n)
space or even worse.How could this be?
Consider this very simple program:
(define (foo x) (baz (bar x)))In a non tail recursive implementation, the call to
baz
would return
back to foo
, which would then return to foo
's caller. In a tail
recursive implementation, foo
would directly jump to baz
and baz
would
return directly to foo
's caller. The difference is that foo
releases
its resources before transferring to baz
in the tail recursive case,
and after baz
's computation in the non tail recursive case. There
are two resources of interest. The obvious one is the stack frame
itself. The less obvious one is the reference to the value of X
.
This is a very important. Code like the above is typical where one is performing a computation that involves part of a data structure. The call to
bar
traverses X
to find the relevant sub-structure to operate
upon. Once this substructure is located, it is no longer necessary to
retain the enclosing structure X
. The tail recursive implementation
immediately discards the reference to X
and allows the garbage
collector to reclaim the resources. The non tail recursive
implementation unnecessarily retains X
during the computation of baz
.
This is not simply an inconvenience. If X
holds substantial
resources, we may run out during baz
. If baz
somehow depends on a
recursive call to foo
, we'll find our space complexity is not a
function of the relevant information sitting in the bar
portion of X
,
but a function of X
itself, the container of our information. If we
write a program that processes text sentence by sentence, we'd be
surprised to find that the implementation uses space proportional to
the entire document!If one objects to Bubble Sort on the grounds of its poor asymptotic time complexity compared to better algorithms, then one clearly should object to a lack of tail recursion. It has even worse asymptotic space complexity, and it acts upon the entire program. (At least Bubble Sort doesn't slow down your program if you don't sort anything.)
Part II on its way....
(By the way, if you have questions or need some clarification on what I've said so far, please comment or email me.)
I think you've written a really good summary of the reasons Guido doesn't like tail recursion. It communicates his view even better than he does.
ReplyDeleteHowever, I feel like you could address your concern with memory usage in this article (I'm expecting other reasons part II) with a simpler optimization that doesn't sacrifice debugging. Just do variable liveness analysis and notice that the variable isn't used after the call, so you can drop its reference before doing the call. Presumably, when you pushed a copy of the reference onto the stack as an argument, so it won't get garbage collected before baz gets done using it.
If I understand you right, you claim that proper tail recursion reduces the space complexity, but what you fail to say is *compared to recursion without proper tail calls*.
ReplyDeleteKind of a nice trick of the semantics there, since you seem to be comparing proper tail recursion to simple loops, but you're really comparing it to poorly implemented functional languages without proper tail recursion. Simple loops also have the same space requirements as tail calls, so your stated benefits don't apply.
Given that the real comparison of interest is efficiency *and usability* of tail recursion vs. simple looping constructs, I believe GvR's statements win.
Finally, Appel proved that SSA is the same as CPS and that since any language can be translated to either, that means you can give a user any syntax that works for them and still give them any stated benefits you claim for functional programming languages.
Since that's the case, the real question isn't the efficiency of one language over another (they are all the same), it is actually whether one language is more usable than another.
And the problem of language usability isn't solved by O() notation.
minor error in the factorial function:
ReplyDelete(define (factorial x)
(define (fact-iter n answer)
(if (> n x)
answer
(fact-iter (+ n 1) (* answer n))))
(fact-iter __x__ 1))
... looking forward to part 2.
Reid: You can certainly delete the reference to X when you notice it is no longer ‘live’. In fact, when you call out to bar you can delete all the references foo has (it's easy to see they won't be needed again). We can go one step further and delete the return address as well because we don't need that, either.
ReplyDeleteBut if we apply this strategy uniformly, we find that we just implemented proper tail recursion. Proper tail recursion is simply releasing the resources as soon as you are done with them rather than waiting until the callee finishes.
Where did you get those quotes from Clinger?
ReplyDeleteThe "big deal" here is also in terms of usability. Tail-calls allow the programmer to think in terms of functions that do their work and "GOTO" the next place to work. It is a natural concept that is disallowed by languages without tail-calls.
ReplyDeleteEnd of the world? No.
Limitation? Yes
Zedshaw: You are correct that simple loops have the same asymptotic space utilization as the equivalent tail recursive version. However, not all tail recursive programs can be easily converted to a simple loop.
ReplyDeleteI have no objection to looping constructs per se.
Notaddicted: Look carefully, the code I posted is correct.
ReplyDeleteGrant: I cornered Will in a dark alley and forced them out of him. (Actually, I got them from the paper I referenced at the top of the post.)
ReplyDeleteWhat I think stood out most as mistaken about Guido's post was his conflating recursion and recursive data structures in his third point. "For practical purposes, Python-style lists (which are flexible arrays, not linked lists), and sequences in general, are much more useful to start exploring the wonderful world of programming than recursion... Most of Python's library is written with sequences and iterators as fundamental building blocks (and dictionaries, of course), not linked lists, so you'd be locking yourself out of a lot of pre-defined functionality by not using lists or sequences." Well, chained cons cells are lists, too, and recursive functions are good for much more than operating over recursive data structures. Recursion is often the only simple solution in messy real world stuff. Just last week I wrote recursive code to do one of the most real-worldy and least Computer Science-y thing imaginable: get data from a public API. I didn't see any reasonable alternative to recursion. (But since that project was, sadly, not written in a language with TCO I am stuck with a function that's going to break if our little Twitter app gets popular, at which point I'll have to do unreasonable hack. Or maybe I'll port it to Scheme instead.)
ReplyDeleteBut I admire Guido and could be misinterpreting what he said.
Your even?/odd? example is not complete: (odd? 0) never returns.
ReplyDeleteIt is a nice example of doubly tail recursive functions, but is quite easy to get wrong.
> That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in O(1) space is actually taking O(n) space or even worse.
ReplyDeleteI would like to see this claim proven.
In Windows the stack space is pre-allocated. Short of actually overflowing the stack, it doesn't usually matter how much you use.
Excellent article.
ReplyDeleteIn Windows the stack space is pre-allocated. Short of actually overflowing the stack, it doesn't usually matter how much you use.This is untrue, though it looks that way to a user program. Modern operating systems keep most of the stack's potential size unallocated until the stack grows and throws a page fault. This makes creating new processes fast and saves space (at the cost of a jump into kernel code every now and then).
ReplyDeleteMore info: http://en.wikipedia.org/wiki/Virtual_memory
By default, Windows programs request that stack space be committed when it is allocated.
ReplyDeleteYou can override this and allocate stack space without committing it, but that's pretty rare.
Perhaps I'm mistaken then, it's been a while since I've done windows development. However, my reading of this MSDN page is that Windows doesn't allocate any space in physical memory for the stack until it is referenced.
ReplyDeleteZed,
ReplyDeleteI feel that there is a disconnect here. If I re-read the original post, I believe the example he is correctly citing would more properly be called inductive programming rather than recursive programming.
Coming up with a "real world" example is putting me at a loss offhand, but some of the data structures I've seen come out of genetics might qualify :). I'll assume for the rest of this post that I for whatever reason want to process an acyclic graph with nodes that contain lists that may point back into the graph.
In my example, it should be pretty obvious that no realistic loop short of a few hundred lines is going to handle both the graph structure, the list structure, and the jump from lists back into the graph. However, a fairly trivial inductive algorithm can handle this data structure cleanly.
With proper tail recursion I can expect O(1) space requirements for my stack while executing my algorithm. Without tail recursion I can expect to cry when trying to analyze the complexity, so I shaln't, but it's obviously going to be something nasty and much much larger than O(1).
I think this as a motivating example for TCO is fairly solid, and probably a much better selling point than "I can write fib with O(n) runtime!"
That said, as you noted, everything can be rewritten in CPS by hand and TCO can be unrolled manually. However, at that point we're just as well arguing that everything is better in assembly?
PS: I'm not saying I really disagree with you here, I just think you missed the motivating example and how it was not explicitly recursive.
This comment has been removed by the author.
ReplyDeleteI agree with Reid, yoru summary of the reasons is good. One important point is that implementing tail recursion in Python pretty much changes the language, so programmers will write code depending on it. It does affects the O(N) properties of programming, and my understanding of Guido's comments is simply he doesn't like that style, and that Python is none the worse for it. I don't see why Python should be all thing to all people, and I do hope that Python does not catch such a case of C++ivitis at some point.
ReplyDeleteZed,
ReplyDeleteIt's certainly true that you could explicitize the TCO by rewriting all of your code in CPS. However:
a) To implement CPS in the Haskell style (as a monad) you /NEED TCO/ to avoid stack overflow!
b) If your language provides CPS with TCO built-in as a primitive then your scheme will work, but rewriting your code in this manner is still highly boring :-)
In fact, Appels' embedding of SSA into CPS relies crucially on continuation-entering not requiring any stack operations!
BTW, I work on a Haskell compiler (GHC), and we do both TCO /and/ treat certainly-dead variables as not being roots for the GC. I.e. TCO alone is not in general sufficient to realise all of the benefits discussed in the post.
Having said all this, I'm not sure TCO is suitable for OO languages like Python. If you do things like inject "aspects" around your function calls - which can happen quite a lot when metaprogramming - then things that look like tailcalls can end up needing to push a stack frame, changing your asymptotic space behaviour.
The loss of debugging info is also a blow :(. In fact, in Haskell if you get an exception there is no stack trace AT ALL! This is because in a lazy language in a sense every call is a tail call.
Cheers,
Max
LOL, thats pretty good dude!
ReplyDeleteRT
www.anonymity.es.tc
Hey, nice article and this is no big deal, but your time complexity analysis is a tad misleading. When you said merge sort is O(n log(n)) you said as an example that 1000 data points would connote ~1000*4 operations, but it is actually more around 1000 * 10 because the log is base 2. This is still significantly better than Bubble sort and such, just wanted you to be correct.
ReplyDeleteYou seem to be saying that stack space is an 'implementation detail' that an implementer is not expected to know about, and can increase the space complexity of an algorithm. However, if that's the case, merely supporting tail calls isn't going to fix it - any other recursion is going to run into the same issue.
ReplyDeleteFurther, one (significant one) of Guido's points remains completely untouched by your argument: There are many implementations of Python. Either they all have to implement tail recursion (and how likely is that?), or implementers won't be able to rely on tail recursion being present, in which case is there any point in having it?
Arachnid: Tail-calls would either be part of the language standard or not, and all implementations which conform would implement it.
ReplyDeleteA multitude of Scheme implementations do as such.
I could be wrong, but I am almost 100% certain that tail-recursive calls are convertable to loops and back without hacks like implementing your own stack.
ReplyDeleteAny Recursive algorithm can be converted to a loops if you implement a stack, and any loop can be implemented as a recursion.
A space-constant tail-recursive call would certainly convert to a loop that doesn't need a local stack. Because the tail-recursive version doesn't need one.
And Zed Shaw knocked it out of the park.
Purely in the spirit of "all generalizations are wrong":
ReplyDeleteBubble sort is only O(n^2) on randomly sorted lists.
On nearly sorted lists, its performance can exceed that of mergesort, shellsort, and/or quicksort, depending on the exact nature of the mis-sorting.
It's really O(N*avg(distanceToSortedPosition)).
Just a pet peeve, and it's only relation to this debate is that tail-recursion's efficiency gains are highly sensitive to programmer proficiency, and are extremely easy to reverse by careless maintenance.
constance eustace said...
ReplyDeleteI could be wrong, but I am almost 100% certain that tail-recursive calls are convertable to loops and back without hacks like implementing your own stack.There is an entire set of design patterns that are not available in languages that lack tail recursion. They are patterns that cannot be converted into loops without breaking open abstraction boundaries, or without losing compositionality.
It is normal that programmers that have never coded in a language with tail recursion would not be familiar with these pattern, and wouldn't miss them. Inversely, programmers that are familiar with them feel their absence.
This box is too small to give examples of these patterns. They are medium- and large-scale modularity patterns, and so presenting them takes some time. I suggest you read the following two papers:
Automata via Macros,
by Shriram Krishnamurthi, in the
Journal of Functional Programming, 2006
Why Functional Programming Matters, by John Hughes, in the Computer Journal and the Year of Programming, 1990.
I find Krishnamurthi's example of a finite state machine as something that inherently requires tails calls to be... curious at best.
ReplyDeleteEvery procedural programmer I know can easily implement a finite state machine with (using C as an example) an enum, a switch, and a for loop.
Not only that, but this kind of implementation is *far* easier to convert to an asynchronous state machine (i.e. where the tokens being evaluated arrive in sequence separated by other events and by time).
I personally have nothing terribly against TCO (as long as you can turn it off for debugging when needed). But I think we're still looking for an example of an easily-understood killer app for it.
The difference is that foo releases its resources before transferring to bar in the tail recursive case, and after bar's computation in the non tail recursive case.I think you meant that sentence to reference baz, not bar.
ReplyDeleteExcellent article! I hadn't before thought about non-TCO hanging on to variable references. Thanks!
Constance Eustache: "tail-recursive calls are convertable to loops and back"
ReplyDeleteYes, but that's a GLOBAL program transformation from t ail-calls to loops, and if you allow such transformation, INTERCAL is just as good a language as Python.
On the other hand, loops can be quite elegantly macro-expressed in tail-calls (see notably Olin Shivers' article on loops), which is a matter of purely local transformations.
PS: my take on Guido's article: http://fare.livejournal.com/142410.html
http://neopythonic.blogspot.com/2009/04/tail-recursion-elimination.html?showComment=1240873740000#c8956549754909773441
Bubblesort is not bad because it is slow. It is bad because it is surprising.
ReplyDeletePerformance (and other) surprises should be avoided. The fact that python consumes stack space for tail calls surprises no-one.
Bubble sort has surprisingly bad performance for large data.
The use of reference counting garbage collection is much more likely could create surprises, perhaps they should fix that?
FWIW: Your odd? function is missing a base case. The functions can never return #f!
ReplyDeleteIt seems like it’d be better to talk about TCO and leave recursion out of it. The benefits to recursion are a subset of the benefits of TCO.
ReplyDeleteJust about every language implementer I’ve ever met would be very proud of implementing TCO. It’s hard to imagine one arguing against it.
Although, I suppose Python tends to conflate specification with implementation. I can understand wanting to keep it out of the specification.
Still, it’s hard for me to understand a language specification not wanting to give programmers more tools—balanced against other issues, of course—rather than fewer. Although, that’s Python’s “there’s only one way to do it” philosophy.
Isn’t Faré right: Implementing loops on top of TCO is easier than implementing TCO on top of loops? Scheme’s path of putting TCO in the spec and providing macros to allow looping constructs to be build on top of it in a library seems awfully sensible to me.