Wednesday, January 28, 2009
ILC 09 Coming up
I had hoped to write a bit more about the Advanced Symbolic programming course, but ILC '09 is coming up soon and there is work to be done.
Friday, January 23, 2009
Propagator networks are a pain to program. You have to break up your
problem into a set of computation nodes, allocate cells for
intermediate values, and wire up the inputs and outputs. It's pretty
straightforward to do this by hand for a small program, but it would
be pretty nasty for something more complex. In addition, propagator
networks for recursive functions require a bit of trickery to get them
to work. On the other hand, the standard recursive descent evaluation
strategy, while being a much easier model to program to, imposes
implicit restrictions on evaluation order that get in the way of
non-deterministic programming.
The solution is to compile our Scheme programs into a propagator network. This is what Chris Hanson showed to us on day 5. This is doable, but it is quite tricky. I don't think Chris had the time to explain all the nuances of the compilation, but if he had a difficult time making it work correctly, it can't be easy.
One thing that I thought was unusual was the way recursive procedures were compiled. The propagator network for a recursive procedure is not instantiated and linked until the procedure is actually invoked. This allows recursive procedures to run by instantiating only as much of the propagator graph as necessary to compute the answer. What strikes me as unusual about this is that a standard expression-based language usually prevents runaway recursion by making conditionals be non-strict. The propagator network seems to have the recursion grounded in a different place. I asked Chris is this was an issue and he thinks it may be what is causing some of the compilation difficulties.
Now that we have generated a propagator network, we need to run it. It is relatively straightforward to create a scheduling process that notes which value cells have changed and pushes the values through the appropriate computation nodes, but the performance is nothing to write home about. What is more interesting is to create a propagation-based virtual machine and run our propagator network against that. We can use JIT technology to accelerate the virtual machine and get reasonable performance. It may also be possible to partition the propagator network into independent or loosely coupled subnets. If so, then we can distribute the computation of the subnets among several physical machines.
This is where we wrapped up the course.
Next post, maybe I'll mention some ideas that occurred to me while I was listening.
The solution is to compile our Scheme programs into a propagator network. This is what Chris Hanson showed to us on day 5. This is doable, but it is quite tricky. I don't think Chris had the time to explain all the nuances of the compilation, but if he had a difficult time making it work correctly, it can't be easy.
One thing that I thought was unusual was the way recursive procedures were compiled. The propagator network for a recursive procedure is not instantiated and linked until the procedure is actually invoked. This allows recursive procedures to run by instantiating only as much of the propagator graph as necessary to compute the answer. What strikes me as unusual about this is that a standard expression-based language usually prevents runaway recursion by making conditionals be non-strict. The propagator network seems to have the recursion grounded in a different place. I asked Chris is this was an issue and he thinks it may be what is causing some of the compilation difficulties.
Now that we have generated a propagator network, we need to run it. It is relatively straightforward to create a scheduling process that notes which value cells have changed and pushes the values through the appropriate computation nodes, but the performance is nothing to write home about. What is more interesting is to create a propagation-based virtual machine and run our propagator network against that. We can use JIT technology to accelerate the virtual machine and get reasonable performance. It may also be possible to partition the propagator network into independent or loosely coupled subnets. If so, then we can distribute the computation of the subnets among several physical machines.
This is where we wrapped up the course.
Next post, maybe I'll mention some ideas that occurred to me while I was listening.
Suppose we are writing a chess playing program where each move is
computed by a non-deterministic process that was implemented as a
search with backtracking. In fact, most chess playing programs are
implemented this way. In the mid game, the program searches a tree of
potential board positions by examining the possible moves one-by-one.
Now it is not the case that your favorite chess program is written
using
Now as you search for a good move, you have a temporal constraint. You must move before your opponent can respond, and you must wait for your opponents move before you can make the next move. The computer simulates you and your opponent taking turns moving pieces and then looks carefully at the possible board positions at the end of the series of moves to determine which side is winning. If the computer is losing, it backtracks to one of its earlier moves and tries an alternative. But this method of backtracking has a big problem: every time you backtrack a move, you forget your opponents response. So suppose you find that if you move your knight, your opponent will kill you with a queen-rook combination 3 moves later. Aha! We'll backtrack to that knight move and try a bishop. And your opponent kills you with the same queen-rook combination 3 moves later. Aha, we'll backtrack to that bishop and try a pawn. (you know what happens). The problem is that your opponent's queen-rook combination is independent of the knight, bishop or pawn moves on your part, but each time you backtrack you forget this and rediscover it. What you really want to do is see that the queen-rook combination in three moves is a threat and figure out a move *now* that affects that outcome. How do you backtrack without forgetting?
Return to the multiple-dwelling puzzle. If you tried to re-order the clauses that generate possible solutions, you'd notice that there is no optimal ordering. Optimizing for one set of constraints makes another one less optimal and you cannot satisfy all at once. Sussman pointed out that this is a direct consequence of the underlying evaluation strategy. Nearly every computer language uses a ‘recursive descent’ evaluation model. (Even if the language is compiled, the various subexpressions are run in a recursive descent order.) To a certain extent, this is an artifact of the abstract syntax being a tree. Each node of the abstract syntax tree represents a language construct that can contain zero or more subnodes (subexpressions), but has exactly one parent node (the containing expression). It is natural to interpret the language through a tree-walk, and thus we get a recursive descent evaluator.
In fact, it is so natural, it is really hard to imagine something different. But an example of a different evaluation mode might be the way Scheme's
A recursive descent evaluator uses the stack to keep track of the control state of our program. When we backtrack to a decision point, we return to a prior control point by discarding the intermediate state. We are changing the control flow of the program. The unwanted side effect of this is that we are also changing the data flow of our program. The values that are returned from subexpressions are computed as we evaluate the subexpressions, and if we back out of the evaluation, we forget the values. But what if we could separate the control flow from the data flow? We cannot completely divorce the two because computing a value usually means transferring control to the expression that denotes the value, but we need not forget the value once it is computed if we back out of a computation.
Here's how we do it: We take our original Scheme program (like the multiple dwelling problem) and generate a propagator network that computes the same value.
We can extend this idea to be more sophisticated. Rather than backtracking to an
It took a bit of time to get to this point. Propagator networks, McCarthy's
amb
. There is a hand-written loop that generates
and tests various moves in various orders looking for a good series of
moves that leads to a good position. But conceptually, you could use
amb
and it would work about the same (if slower).
Now as you search for a good move, you have a temporal constraint. You must move before your opponent can respond, and you must wait for your opponents move before you can make the next move. The computer simulates you and your opponent taking turns moving pieces and then looks carefully at the possible board positions at the end of the series of moves to determine which side is winning. If the computer is losing, it backtracks to one of its earlier moves and tries an alternative. But this method of backtracking has a big problem: every time you backtrack a move, you forget your opponents response. So suppose you find that if you move your knight, your opponent will kill you with a queen-rook combination 3 moves later. Aha! We'll backtrack to that knight move and try a bishop. And your opponent kills you with the same queen-rook combination 3 moves later. Aha, we'll backtrack to that bishop and try a pawn. (you know what happens). The problem is that your opponent's queen-rook combination is independent of the knight, bishop or pawn moves on your part, but each time you backtrack you forget this and rediscover it. What you really want to do is see that the queen-rook combination in three moves is a threat and figure out a move *now* that affects that outcome. How do you backtrack without forgetting?
Return to the multiple-dwelling puzzle. If you tried to re-order the clauses that generate possible solutions, you'd notice that there is no optimal ordering. Optimizing for one set of constraints makes another one less optimal and you cannot satisfy all at once. Sussman pointed out that this is a direct consequence of the underlying evaluation strategy. Nearly every computer language uses a ‘recursive descent’ evaluation model. (Even if the language is compiled, the various subexpressions are run in a recursive descent order.) To a certain extent, this is an artifact of the abstract syntax being a tree. Each node of the abstract syntax tree represents a language construct that can contain zero or more subnodes (subexpressions), but has exactly one parent node (the containing expression). It is natural to interpret the language through a tree-walk, and thus we get a recursive descent evaluator.
In fact, it is so natural, it is really hard to imagine something different. But an example of a different evaluation mode might be the way Scheme's
syntax-rules
hygienic macro system works.
Rather than recursively evaluation of the subforms of an expression,
macro expansion involves a pattern-match against the entire expression
followed by a rewrite of the form. (Then comes a recursive descent
phase when we expand the subforms.)A recursive descent evaluator uses the stack to keep track of the control state of our program. When we backtrack to a decision point, we return to a prior control point by discarding the intermediate state. We are changing the control flow of the program. The unwanted side effect of this is that we are also changing the data flow of our program. The values that are returned from subexpressions are computed as we evaluate the subexpressions, and if we back out of the evaluation, we forget the values. But what if we could separate the control flow from the data flow? We cannot completely divorce the two because computing a value usually means transferring control to the expression that denotes the value, but we need not forget the value once it is computed if we back out of a computation.
Here's how we do it: We take our original Scheme program (like the multiple dwelling problem) and generate a propagator network that computes the same value.
AMB
nodes in the propagator
network arbitrarily pick one of the values to emit. As suggested
earlier, we'll propagate information about the values (the provenance)
along with the values themselves. At the require
nodes,
we check to see if we have satisfied the requirement. If we have,
everything is ok, but if not, we examine the provenances of the values
that the requirement depends upon. The relevent AMB
nodes will be in the provenances, and we can select one or more of
those nodes to request a different value. This is backtracking, but
it only backtracks the control directly involved in the data flow that
supplies affects the requirement. Other parts of the computation
remain unaffected. This is true even if the other parts of the
computation were orginally syntactically intertwined with the part we
are backing out of. This is caled dependency-directed
backtracking.We can extend this idea to be more sophisticated. Rather than backtracking to an
AMB
node and telling it to blindly
choose a different alternative, we may be able to give it some hints
to select an alternative in a more intelligent way. For instance, if
we have deduced that Cooper lives on the first floor, but find we have
to recompute Fletcher (who is not on the top or bottom, nor next to
Cooper), we can use the constraints to immediately put Fletcher on
floor 3 rather than cycling through all the possibilities.It took a bit of time to get to this point. Propagator networks, McCarthy's
AMB
, recursive descent evaluation,
and backtracking are all old ideas, so it was easy to listen to this
part of the seminar and more or less consider it just a review. But
while the parts may be familiar, this way of combining them is new.
In the next post, I'll review day 5, where we put a bit more thought
into building programs this way.Thursday, January 22, 2009
Day 4, Part 2
The next part of the talk was given by Alexey Radul, a graduate
student of Gerry's. Alexey described propagator networks. The idea
is quite simple: you create a network that is built out of two kinds
of parts. One kind of part is a stateless computational node.
Computation nodes can have arbitrary inputs and outputs, but the
outputs are a pure function of the inputs. (This doesn't mean that
the guts of the node can't have state, just that any such state cannot
be observed from examination of the outputs.) The second kind of part
is a stateful value cell. Cells do no computation at all, but they
save the results of the computational nodes. You compute things by
creating a network that is specialized to the problem you are trying
to solve.
Computation networks like this are not a new idea. The first spreadsheet programs developed in the late 60s and early 70s are computation networks. (Spreadsheets have a lot of structure that is an artifact of the original paper model. For example, the physical layout of rows and columns of cells is very strongly built in to the computation model.) Despite their age, propagation networks are still a model that is worth exploring. Gerry and Alexey described some interesting ideas.
First, Gerry and Alexey have narrowed their exploration to unidirectional propagation. Each computation node only knows how to compute the output from the input. This is different from a constraint network where a computation node may be able to ‘run backwards’ and deduce an input value given the output. Constraints are trivially added by simply making a new propagator that runs from the output back to the input. That is, the computation nodes are unidirectional, but the network topology can have circular connections and can therefore mimic bi-directional propagation.
The ability to create arbitrary connections is the reason we want to use a propagator network, but you are immediately faced with the key problem of such a network: what do you do if you compute a value, but the receiving cell already has a value? There are a few potential solutions to this problem:
This leads to this interesting idea: Rather than put values in the cells, we'll accumulate information about the values and put that in the cells. What sort of information? As the computation propagates through the network, we can tag each value and note where it came from and what computations were done to compute it. A value will come with a provenance that we can examine. Now suppose we find that we are about to assign a value to a cell that already has a value. We can look at the two provenances and see if there is reason to believe one more than the other. Or if the values agree, we can combine the provenances to show that the value is supported by multiple methods of computation.
Or perhaps you are a Bayesian and want each cell to represent a belief (probability distribution) about a value. Then the value we assign to the cell would be a distribution that has been updated by a new piece of evidence. Again, we could use the provenance idea to determine what evidence was important in determining a value.
Gerry and Alexey have a paper coming out soon called The Art of the Propagator. It goes much further than I did here and is much clearer, so I suggest reading it as soon as it appears.
Ok, keep that idea in the back of your mind while we revisit the problem of
Computation networks like this are not a new idea. The first spreadsheet programs developed in the late 60s and early 70s are computation networks. (Spreadsheets have a lot of structure that is an artifact of the original paper model. For example, the physical layout of rows and columns of cells is very strongly built in to the computation model.) Despite their age, propagation networks are still a model that is worth exploring. Gerry and Alexey described some interesting ideas.
First, Gerry and Alexey have narrowed their exploration to unidirectional propagation. Each computation node only knows how to compute the output from the input. This is different from a constraint network where a computation node may be able to ‘run backwards’ and deduce an input value given the output. Constraints are trivially added by simply making a new propagator that runs from the output back to the input. That is, the computation nodes are unidirectional, but the network topology can have circular connections and can therefore mimic bi-directional propagation.
The ability to create arbitrary connections is the reason we want to use a propagator network, but you are immediately faced with the key problem of such a network: what do you do if you compute a value, but the receiving cell already has a value? There are a few potential solutions to this problem:
- Make it impossible. Require that the computation network is strictly tree-like. This is boring.
- Make it illegal. Raise an error if the network ever attempts
to modify a value cell that already has a value.
- Make it illegal unless it is benign. Raise an error if the network ever attempts to *change* a value in a cell. Attempts to re-write the existing value are ignored. This raises the question of what might be considered equivalent. At one end of the spectrum, the new value might be required to EQ to the existing one to be considered benign. At the other end, maybe they both match the same pattern.
- Ignore it. Discard the new value.
- Allow it. Discard the old value. Continue propagating the new value to any computation nodes connected to the cell.
- Do something else (call a function) involving both the old and new value.
This leads to this interesting idea: Rather than put values in the cells, we'll accumulate information about the values and put that in the cells. What sort of information? As the computation propagates through the network, we can tag each value and note where it came from and what computations were done to compute it. A value will come with a provenance that we can examine. Now suppose we find that we are about to assign a value to a cell that already has a value. We can look at the two provenances and see if there is reason to believe one more than the other. Or if the values agree, we can combine the provenances to show that the value is supported by multiple methods of computation.
Or perhaps you are a Bayesian and want each cell to represent a belief (probability distribution) about a value. Then the value we assign to the cell would be a distribution that has been updated by a new piece of evidence. Again, we could use the provenance idea to determine what evidence was important in determining a value.
Gerry and Alexey have a paper coming out soon called The Art of the Propagator. It goes much further than I did here and is much clearer, so I suggest reading it as soon as it appears.
Ok, keep that idea in the back of your mind while we revisit the problem of
amb
and backtracking.Wednesday, January 21, 2009
Day 4, Part 1
I've got to stop thinking that I must complete, polish and edit before posting to a blog. The reason I started one is so I could publish un-finished things before I forgot them.
On Thursday we covered something quite a bit more interesting in Gerry's lecture. First, though, I'm going to discuss the
There are a few ways to look at what the
Of course we don't implement the
The simulation differs from the abstract model in a few important ways, some of these differences (and especially the consequences of these differences) are not obvious. The first obvious benefit is that by sequentially testing solutions, we limit our peak resource use. The tradeoff here is that we have to serially search through the solution space, so we have traded space for time. This is obvious because it is the tradeoff we desired. A less obvious benefit is this: if we don't use every solution to the problem, there is no need to exhaustively test every possibility. If we stop at the first solution, we are likely to use less time than if we try to enumerate all the solutions. So by testing sequentially we have traded space for *less* time. Another less obvious benefit is that instead of having just `solutions' and `failures', we can have a continuum of `great solutions', `good solutions', `poor solutions', `terrible solutions', etc. We can now dynamically decide how much time we wish to trade for a solution of sufficient quality. We may have to wait a long time to find the best solution, but we find an acceptable solution quite quickly.
These tradeoffs depend directly on how we decide to order the sequential testing. Refer to the multiple-dwelling problem. If
What's going on here? Our implementation of
Well, no, not really. Since we are the ones doing the re-ordering of clauses, the computer is making us do the hard work. Sure it is reasonably easy for this problem, but suppose the search space was for a good chess move. Should we examine queen moves first, or knight moves? What requirements should be checked and when? If we try a particular ordering, can we re-order the search?
In our implementation of
We're now a bit more clever, but how do we know which order is best? We don't. We can only guess. On the other hand, the computer has more information than we have. If we annotate how the choices are made by
Now I'm going to abandon this line of though for a little while. I'll return to it.
Next post: propagator networks.
On Thursday we covered something quite a bit more interesting in Gerry's lecture. First, though, I'm going to discuss the
AMB
operator in more detail. Here's a simple puzzle that is trivially
solved via appropriate use of AMB
Baker, Cooper, Fletcher, Miller, and Smith live on different floors of an apartment house that contains only five floors. Baker does not live on the top floor. Cooper does not live on the bottom floor. Fletcher does not live on either the top or the bottom floor. Miller lives on a higher floor than does Cooper. Smith does not live on a floor adjacent to Fletcher's. Fletcher does not live on a floor adjacent to Cooper's. Where does everyone live? (define (multiple-dwelling) (let ((baker (amb 1 2 3 4 5)) (cooper (amb 1 2 3 4 5)) (fletcher (amb 1 2 3 4 5)) (miller (amb 1 2 3 4 5)) (smith (amb 1 2 3 4 5))) ;; They live on different floors. (require (distinct? (list baker cooper fletcher miller smith))) ;; Baker does not live on the top floor. (require (not (= baker 5))) ;; Cooper does not live on the bottom floor. (require (not (= cooper 1))) ;; Fletcher does not live on either the top ;; or the bottom floor. (require (not (= fletcher 5))) (require (not (= fletcher 1))) ;; Miller lives on a higher floor than does Cooper. (require (> miller cooper)) ;; Smith does not live on a floor adjacent to Fletcher's. (require (not (= (abs (- smith fletch)) 1))) ;; Fletcher does not live on a floor adjacent to Cooper's. (require (not (= (abs (- fletcher cooper)) 1))) (list (list 'baker baker) (list 'cooper cooper) (list 'fletcher fletcher) (list 'miller miller) (list 'smith smith))))Even if this seems familiar to you, don't let your eyes glaze over, this is important.
There are a few ways to look at what the
AMB
operator
does. The first is very abstract: AMB
splits the
interpreter into several copies, each copy attempts to proceed in
parallel with one of the values, and if any interpreter computes an
answer, we'll use it. This is the essence of non-deterministic
programming: we try all the possibilities at the same time and
arbitrarily pick a successful one.Of course we don't implement the
AMB
operator
that way. We certainly can't split a physical computer into several
copies, and although we might be able to do that with virtual
machines, the amount of memory and CPU usage would be way too high.
Instead we look at the AMB
operator in this way:
AMB
makes a snapshot of the state of the interpreter and
then arbitrarily chooses one of the possible values to proceed
with. If the computation doesn't fail, we can discard the snapshot,
but if it does fail, we restore the snapshot and pick a different
possible value. In this case, we are simulating
non-deterministic programming by sequential iterative testing.
The simulation differs from the abstract model in a few important ways, some of these differences (and especially the consequences of these differences) are not obvious. The first obvious benefit is that by sequentially testing solutions, we limit our peak resource use. The tradeoff here is that we have to serially search through the solution space, so we have traded space for time. This is obvious because it is the tradeoff we desired. A less obvious benefit is this: if we don't use every solution to the problem, there is no need to exhaustively test every possibility. If we stop at the first solution, we are likely to use less time than if we try to enumerate all the solutions. So by testing sequentially we have traded space for *less* time. Another less obvious benefit is that instead of having just `solutions' and `failures', we can have a continuum of `great solutions', `good solutions', `poor solutions', `terrible solutions', etc. We can now dynamically decide how much time we wish to trade for a solution of sufficient quality. We may have to wait a long time to find the best solution, but we find an acceptable solution quite quickly.
These tradeoffs depend directly on how we decide to order the sequential testing. Refer to the multiple-dwelling problem. If
AMB
is implemented by sequential left-to-right trials,
then we'll first try every combination that puts Baker on the first
floor. Once we'e exhausted those, we try every combination that puts
Baker on the second floor. Then we'll check the third, etc. It
should be obvious that we can find a solution much faster if we can
eliminate some of the problem constraints as quickly as possible. For
example, we could trim the AMB
forms to skip some
combinations before we even run the program. Why does Fletcher have
(amb 1 2 3 4 5)
when we already know that possibility 1
and 5 cannot lead to a solution? Another thing we can do is re-order
the AMB
clauses and evaluate the require
clauses sooner. We know that Fletcher cannot live next to Cooper, so
we can check that reqirement before we even assign a floor to Miller
and Smith and save ourselves a huge amount of time.What's going on here? Our implementation of
AMB
has
imposed an implicit order on the search through our possible
solution space. By re-arranging the AMB
clauses and
testing the restrictions intelligently, we direct the code to search
the more likely places in our solution space first and to avoid parts
of our search space that we know don't contain valid
solutions. We're pretty clever, eh?Well, no, not really. Since we are the ones doing the re-ordering of clauses, the computer is making us do the hard work. Sure it is reasonably easy for this problem, but suppose the search space was for a good chess move. Should we examine queen moves first, or knight moves? What requirements should be checked and when? If we try a particular ordering, can we re-order the search?
In our implementation of
AMB
, every time we fail to meet
a requirement, we backtrack to the most recent AMB
and
try the next option. This is called depth-first chronological
backtracking. It naturally falls out of the implementation. We can
change the implementation slightly. Rather than AMB
making a snapshot of the world and proceeding along the first
suggested path through the search space, we'll have AMB
prepare a number of snapshots (one for each value) and place them in a
queue. Then whenever we reach a failure point, we'll restart by
popping a suggested search path off the queue. If you think about
this a bit, you'll come to the conclusion that we can now direct the
order in which we seach the solution space by defining a policy on how
we use the queue. If we use the queue as a LIFO stack, then we get
depth-first chronological backtracking. If we use the queue
as a FIFO, we get breadth-first backtracking. So now we can
change how we traverse the possible solution space without
re-ordering the code.We're now a bit more clever, but how do we know which order is best? We don't. We can only guess. On the other hand, the computer has more information than we have. If we annotate how the choices are made by
amb
we can make the backtracking smarter. We could, for
example, assign a score to each potential path that is waiting in the
queue and when we backtrack, select the highest scoring path. If we
find that some path cannot satisfy a requirement, perhaps we can check
the pending computations in the queue and simply remove the ones that
would also follow the same path.Now I'm going to abandon this line of though for a little while. I'll return to it.
Next post: propagator networks.
Thursday, January 15, 2009
Again, we covered some more material that I'm passingly familiar with
and I think a lot of Lisp and Scheme hackers would find familiar as
well. The first thing was how to introduce implicit backtracking in a
combinator-based interpreter.
The first step is to curry
Larceny uses something almost exactly like this in its interpreter. In fact, the Scheme interpreter I'm writing does this as well, although it is difficult to see that because of all the extra cruft to support tail recursion and first-class continuations.
By changing the structure of the interpreter combinators to continuation-passing-style it is reasonably easy to add the McCarthy
Gerry then went on to show that if you can grab the implicit continuation of the underlying language, you can embed
I'm not sure how much to write here. Since I've been steeping myself in this sort of thing for the past several years, most of this stuff seems fairly straightforward, if esoteric.
eval
.
(define (eval expression environment) ((analyze expression) environment))This allows us to separate the static analysis of the program from the running of the program with ‘real data’.
analyze
is really a compiler. Now all the subcases of
eval
will emit procedures that take a single argument of
the environment and return the computed value.Larceny uses something almost exactly like this in its interpreter. In fact, the Scheme interpreter I'm writing does this as well, although it is difficult to see that because of all the extra cruft to support tail recursion and first-class continuations.
By changing the structure of the interpreter combinators to continuation-passing-style it is reasonably easy to add the McCarthy
AMB
operator.Gerry then went on to show that if you can grab the implicit continuation of the underlying language, you can embed
AMB
directly
rather than layering it on in an isolated interpreter. He pointed out
that Scheme was one of the few languages that supplied first-class
continuations. (I did have to point out that by using the tricks in
‘Continuations from Generalized Stack Inspection’ you can embed
first-class continuations in nearly any language, but of course the
contortions you have to go through are pretty horrific.)I'm not sure how much to write here. Since I've been steeping myself in this sort of thing for the past several years, most of this stuff seems fairly straightforward, if esoteric.
Wednesday, January 14, 2009
You say ‘ad hoc’ as if it were a bad thing.
The term ad hoc is a Latin phrase meaning ‘for this purpose’. In 1967, Christopher Strachey used the term to differentiate the kind of polymorphism where a function acts differently on different types from the kind of polymorphism where a generalized function can be parameterized by a range of types. The generic functions of CLOS are a good example of ad hoc polymorphism, and Java ‘generics’ are an example of parameterized polymorphism.
The whole point of generic functions is to allow you to treat objects of different types in a uniform, abstract manner. Generic arithmetic is the motivating example: you want to add two numbers and you really don't care whether they are floating point, integers, rationals, bignums, or whatever, just add the things, ok? Both parametric and ad hoc polymorphism allow you to do this.
Ad hoc is often used to imply something beyond the Latin meaning of specific purpose. It can mean ‘jury-rigged’, ‘makeshift’, ‘spur of the moment’, or ‘not well thought out’. In Wadler and Blott's paper “How to make ad-hoc polymorphism less ad-hoc”, they state:
But what is the problem with ad hoc polymorphism? The primary problem seems to be that unrestricted use of ad hoc polymorphism makes it impossible to fully type check at compile time and requires some sort of run time type dispatch.
Suffice it to say that I'm underwhelmed by this argument.
I have found that parametric polymorphism is a fine tool for the problems it is designed to solve, but ad hoc polymorphism can solve those problems, too. Furthermore, ad hoc polymorphism allows you to do things that are quite difficult to do with parametric polymorphism, for example, add methods to
I like ad hoc polymorphism. It allows me to tailor a solution to exactly solve a specific problem, and so its name is singularly appropriate.
The whole point of generic functions is to allow you to treat objects of different types in a uniform, abstract manner. Generic arithmetic is the motivating example: you want to add two numbers and you really don't care whether they are floating point, integers, rationals, bignums, or whatever, just add the things, ok? Both parametric and ad hoc polymorphism allow you to do this.
Ad hoc is often used to imply something beyond the Latin meaning of specific purpose. It can mean ‘jury-rigged’, ‘makeshift’, ‘spur of the moment’, or ‘not well thought out’. In Wadler and Blott's paper “How to make ad-hoc polymorphism less ad-hoc”, they state:
One widely accepted approach to parametric polymorphism is the Hindley/Milner type system... On the other hand, there is no widely accepted approach to ad hoc polymorphism, and so its name is doubly appropriate.It is clear that Wadler and Blott are assuming a perjorative meaning.
But what is the problem with ad hoc polymorphism? The primary problem seems to be that unrestricted use of ad hoc polymorphism makes it impossible to fully type check at compile time and requires some sort of run time type dispatch.
Suffice it to say that I'm underwhelmed by this argument.
I have found that parametric polymorphism is a fine tool for the problems it is designed to solve, but ad hoc polymorphism can solve those problems, too. Furthermore, ad hoc polymorphism allows you to do things that are quite difficult to do with parametric polymorphism, for example, add methods to
null
or other built-in classes, or make universal constructors (compare make-instance
to the plethora of Java ‘factory’ classes).I like ad hoc polymorphism. It allows me to tailor a solution to exactly solve a specific problem, and so its name is singularly appropriate.
I'm afraid I have much less to report today. Gerry and Chris went
over techniques used for pattern matching, parsing, and term
rewriting. It is probably the case that the material was new for some
of the people in the class, but die-hard Lisp and Scheme hackers will
find this sort of thing to be quite familiar.
Gerry started with a combinatoric pattern matcher. The basic structure was that each sort of matching construct would return a procedure that performed the actual match. These procedures all have the same signature and return value conventions, so it is possible to mix and match them. Gerry used a single ‘success’ continuation and simply returned
I almost forgot. Gerry made one of those comments that seem obvious at first, but actually have a lot of depth behind them. “Here we have a simple pattern matcher. Pattern matching is a generalization of equality.”
It's funny, but I've been programming for all these years and that connection never occurred to me. Now I know that I'm probably the only person on the planet that didn't notice this, and everyone is wondering how on earth I can get a programming job if I'm this obtuse, but I think this is a very interesting observation. It answers the question of why there are so many equality predicates in Lisp: consider one of the operands to be a pattern and the other to be an object to match against. The ‘pattern’ object doesn't have any way to indicate the intent of the match. Suppose I want to match any three element list that contains the symbols
That's a pretty nice answer, but now I have more questions. Why are
I hear Otto asking “You eat a lot of acid, Miller, back in the hippie days?”
On an unrelated topic, Gerry and I argued a bit about floating point numbers. We violently agree that they are a good tool for certain use cases and a poor tool for others. He gave an example of what he thought was a problem, but I objected that his particular example was handled correctly be the IEEE spec. I've got what I think is a better example:
What happened is this: floating point numbers are not evenly distributed, and there are a lot more of them down near zero than there are near 1.0 This means our answers will be a lot more accurate if we do our computations near zero rather than near one. (Since there are more floats near zero, if we have to round, we don't have to go very far to find a nearby float. But if we are up near 1.0, we have to go much, much further to find a nearby float.)
The actual answer is 1/182138880000001, and the closest floating point number to that is 5.4903159610951515e-15 The first method produced a much more accurate answer, but neither answer is exactly right.
Gerry started with a combinatoric pattern matcher. The basic structure was that each sort of matching construct would return a procedure that performed the actual match. These procedures all have the same signature and return value conventions, so it is possible to mix and match them. Gerry used a single ‘success’ continuation and simply returned
#f
if the match failed. The higher-order matchers
would propagate the #f
from the lower order ones. Chris said he
preferred to use both a ‘win’ and ‘lose’ continuation instead.I almost forgot. Gerry made one of those comments that seem obvious at first, but actually have a lot of depth behind them. “Here we have a simple pattern matcher. Pattern matching is a generalization of equality.”
It's funny, but I've been programming for all these years and that connection never occurred to me. Now I know that I'm probably the only person on the planet that didn't notice this, and everyone is wondering how on earth I can get a programming job if I'm this obtuse, but I think this is a very interesting observation. It answers the question of why there are so many equality predicates in Lisp: consider one of the operands to be a pattern and the other to be an object to match against. The ‘pattern’ object doesn't have any way to indicate the intent of the match. Suppose I want to match any three element list that contains the symbols
(a b c)
, but you want to match
the exact instance of a particular list that contains the symbols (a b
c)
. The ‘pattern’ in these cases is identical, so we have to indicate
the strictness of matching elsewhere — in the predicate. So I call
‘equal’ and you call ‘eq’. That's a pretty nice answer, but now I have more questions. Why are
EQ
, EQL
, EQUAL
, EQUALP
, etc. built in operations, but a generalized
pattern matcher not built in? I use EQ
a lot in my code and my
mental model is that I'm testing for object identity. How about if I
relax my mental model and think about my uses of EQ
as special
instances of pattern matching. Would I get any advantages? A
generalized pattern matcher carries around a dictionary of pattern
variables so you can match substructures against previously matched
substructure. How come the equality predicates like EQUAL
don't have
this? What if they did? Some computer languages eschew extensional
equality: in order to test equality, the objects in question have to
have an ‘is equal’ method, otherwise you are out of luck.
(Extensional equality means that I can take two arbitrary objects and
determine their equality without calling any method on either object.
Equality is an external property of the object. Intensional
equality means that the object or the object class defines what it
means for objects of that type to be considered equal. Equality is an
internal property of the object.) If the language discourages or
prohibits extensional equality, is that an artifact of discouraging or
prohibiting extensional pattern matching in general? If so, isn't
that a horrible error in design? If not, then why is EQ
singled out?I hear Otto asking “You eat a lot of acid, Miller, back in the hippie days?”
On an unrelated topic, Gerry and I argued a bit about floating point numbers. We violently agree that they are a good tool for certain use cases and a poor tool for others. He gave an example of what he thought was a problem, but I objected that his particular example was handled correctly be the IEEE spec. I've got what I think is a better example:
The odds against a meteor hitting your house are 182,138,880,000,000 to 1. (I read it online, so it must be true.) What is the probability of a meteor striking your house? Answer 1: Given the odds against something, compute the odds for it by taking the reciprocal. Convert this to a probability with this formula: p = o/(o + 1) (odds->probability (/ 1.0 182138880000000)) 5.490315961095151e-15 Answer 2: Given the odds against something, compute the probability against it with p = o/(o + 1). The probability for it is simply 1 minus the probability against it. (- 1.0 (odds->probability 182138880000000)) 5.440092820663267e-15Those answers differ in the second decimal place. Shouldn't they be the same? Which one is right? Or are they both wrong?
What happened is this: floating point numbers are not evenly distributed, and there are a lot more of them down near zero than there are near 1.0 This means our answers will be a lot more accurate if we do our computations near zero rather than near one. (Since there are more floats near zero, if we have to round, we don't have to go very far to find a nearby float. But if we are up near 1.0, we have to go much, much further to find a nearby float.)
The actual answer is 1/182138880000001, and the closest floating point number to that is 5.4903159610951515e-15 The first method produced a much more accurate answer, but neither answer is exactly right.
Monday, January 12, 2009
Adventures inAdvanced Symbolic Programming (part 1)
Gerry Sussman and Chris Hanson are giving a series of 5 whirlwind lectures on advanced symbolic programming. I'm sitting in and I figure I'll make some comments here (rather than distract from the class). My hope is that the lectures will be available on-line sometime.
The first part of today's lecture was about using generic functions. This was pretty straightforward stuff, but there were a couple of interesting details. First of all, the generic arithmetic was extended to handle functions: (f + g)(x) = f(x) + g(x) Then, with the addition of hyperreals to the numeric tower, it was possible to compute differentials. Finally, by adding symbolic quantities to the numeric tower, you could compute derivatives of functions.
The non-obvious thing to me was that you could compute derivatives symbolically by using symbolic hyperreals, and that you need to compute the differentials using a two-tuple. This is going to sound completely opaque to anyone reading this, so I'd better go into more detail.
Consider a rational number. One way of looking at it is as if it is simply a pair of integers with some weird rules for addition, subtraction, multiplication and division. Now consider a complex number. Again, it is simply a pair of numbers with some weird rules for addition, subtraction, multiplication and division. A hyperreal is the same sort of thing: a pair of numbers with some weird rules for addition, subtraction, multiplication and division.
Now suppose we re-define the primitive +, -, *, and /. We allow them to take symbols as arguments. When given a symbol, we return an s-expression result. Now if we have previously defined rationals as being a pair of numbers with weird rules for addition, subtraction, etc., we can get symbolic rationals automagically by allowing the primitives to take symbolic arguments. (The rules for rational arithmetic being based on the underlying arithmetic primitives.) With a couple of additional rules for constructing rationals from symbols, we're all set for symbolic rational arithmetic.
We can do a similar trick with complex numbers.
And we can do a similar trick with the hyperreals. But here's a more interesting trick: A hyperreal [a, b] has a standard part `a' and an infinitessimal part 'b'. So the symbolic hyperreal [x, 1] means `x + dx'. Now consider the function (lambda (a) (* a a a)), the cube function. If we apply it to the symbolic number x and the symbolic hyperreal [x,1], and subtract the two, f(x + dx) - f(x), and then extract the differential part, we have the derivative. And sure enough, if we evaluate this in the interpreter:
Now that's cool.
I have some more thoughts on this, but I think they can wait for later in the week.
The first part of today's lecture was about using generic functions. This was pretty straightforward stuff, but there were a couple of interesting details. First of all, the generic arithmetic was extended to handle functions: (f + g)(x) = f(x) + g(x) Then, with the addition of hyperreals to the numeric tower, it was possible to compute differentials. Finally, by adding symbolic quantities to the numeric tower, you could compute derivatives of functions.
The non-obvious thing to me was that you could compute derivatives symbolically by using symbolic hyperreals, and that you need to compute the differentials using a two-tuple. This is going to sound completely opaque to anyone reading this, so I'd better go into more detail.
Consider a rational number. One way of looking at it is as if it is simply a pair of integers with some weird rules for addition, subtraction, multiplication and division. Now consider a complex number. Again, it is simply a pair of numbers with some weird rules for addition, subtraction, multiplication and division. A hyperreal is the same sort of thing: a pair of numbers with some weird rules for addition, subtraction, multiplication and division.
Now suppose we re-define the primitive +, -, *, and /. We allow them to take symbols as arguments. When given a symbol, we return an s-expression result. Now if we have previously defined rationals as being a pair of numbers with weird rules for addition, subtraction, etc., we can get symbolic rationals automagically by allowing the primitives to take symbolic arguments. (The rules for rational arithmetic being based on the underlying arithmetic primitives.) With a couple of additional rules for constructing rationals from symbols, we're all set for symbolic rational arithmetic.
We can do a similar trick with complex numbers.
And we can do a similar trick with the hyperreals. But here's a more interesting trick: A hyperreal [a, b] has a standard part `a' and an infinitessimal part 'b'. So the symbolic hyperreal [x, 1] means `x + dx'. Now consider the function (lambda (a) (* a a a)), the cube function. If we apply it to the symbolic number x and the symbolic hyperreal [x,1], and subtract the two, f(x + dx) - f(x), and then extract the differential part, we have the derivative. And sure enough, if we evaluate this in the interpreter:
(print-expression ((D (lambda (a) (* a a a))) 'x)) => (* 3 (expt x 2))
Now that's cool.
I have some more thoughts on this, but I think they can wait for later in the week.