## Monday, August 18, 2008

### Differential Benchmarking

I wanted to find out the fastest way to return multiple values in the Microsoft CLR. In order to measure this, I needed to have two pieces of code that `do the same thing' except for the way values are returned. For a baseline, I used the Takeuchi function:
static int Tak (int x, int y, int z)
{
return (!(y < x))
? z
: Tak (Tak (x - 1, y, z),
Tak (y - 1, z, x),
Tak (z - 1, x, y));
}
Tak (18, 12, 6) makes 63609 function calls and 47706 subtractions.

This version uses an out variable to return the value:
void TakOut (out int answer, int x, int y, int z)
{
if (!(y < x)) {
return;
}
int x1;
int y1;
int z1;
TakOut (out x1, x - 1, y, z);
TakOut (out y1, y - 1, z, x);
TakOut (out z1, z - 1, x, y);
TakOut (out answer, x1, y1, z1);
}
But the version using an out variable has an additional argument. I wanted to separate the cost of passing an additional argument from the total cost of using an out parameter, so I wrote another version of Tak that took an extra dummy argument:
static int Tak1 (int dummy, int x, int y, int z)
{
return (!(y < x))
? z
: Tak1 (z, Tak1 (y, x - 1, y, z),
Tak1 (x, y - 1, z, x),
Tak1 (dummy, z - 1, x, y));
}
These run pretty quick, so I put them in a loop that runs them ten times and measures the total time. I ran this loop a thousand times and recorded the results.

I didn't expect to find Tak1 to be substantially faster than Tak, so I did some investigating. Look at the graph for Tak: You can clearly see a change in performance (there are two vertical segments). I had forgotten that my machine was in power save mode and was dynamically tuning the CPU clock speed. Somewhere in there the machine decided to hit the gas. I put the machine in performance mode to max out the clock and got a much more expected result.
In this graph, Tak is the normal Tak function. Tak1a and Tak1b are defined as follows:
static int Tak1a (int dummy, int x, int y, int z)
{
return (!(y < x))
? z
: Tak1a (dummy, Tak1a (dummy, x - 1, y, z),
Tak1a (dummy, y - 1, z, x),
Tak1a (dummy, z - 1, x, y));
}

static int Tak1b (int dummy, int x, int y, int z)
{
return (!(y < x))
? z
: Tak1b (z, Tak1b (y, x - 1, y, z),
Tak1b (x, y - 1, z, x),
Tak1b (dummy, z - 1, x, y));
}
We can see from the graph that it takes an additional 8-10 ns to process the additional argument, depending on whether we change the value or keep it the same. I'll speculate that this is because the processor can notice the re-use of the argument and avoid a register rename or something along those lines.

So what about passing an out variable? From this graph we can see that returning an answer in an out variable makes the code a factor of 2 slower. Using a ref variable is slightly worse, but that may be because we have to initialize the ref variable before we use it. In either case, that's quite a hit.

It isn't likely that returning an object from the heap will be fast, but we ought to actually mesaure it. This is easily done by simply declaring the return value to be of type object. The CLR will be forced to box the object on the heap.
Surprisingly, this isn't all that much slower than using an out or ref variable. What about returning a struct on the stack? It seems that this is quicker than using an out variable.

If you look closely at these benchmarks you will see that the baseline performance of Tak varies a bit from run to run. This could be because of a number of reasons. It doesn't matter all that much because I was interested in the difference between the various mechanisms of returning values.

All the code above was run with these settings:
# 2 processors
#     GenuineIntel
#     Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
#     x64 Family 6 Model 15 Stepping 11
#     2401 MHz
# Working set:  9969664
# Overflow checking is disabled.
# Compiled in RELEASE mode.
# Debugger is not attached.
# Stopwatch is high resolution.  14318180 ticks per second.
#
# 2008-Aug-18 20:50:11
# Microsoft Windows NT 6.0.6001 Service Pack 1
# CLR 2.0.50727.3053
#
# See http://eval.apply.googlepages.com/ for further information.
#
# TakTest, Version=1.0.3152.23099, Culture=neutral, PublicKeyToken=null
# Debugging Flags: IgnoreSymbolStoreSequencePoints
# JIT Optimizer enabled
# JIT Tracking disabled
I've put the code online if you want to play with it.
I changed the search engine at the top of the page to search a whole bunch of Lisp oriented web sites in addition to the Hyperspec.

## Monday, August 11, 2008

Consider this client/server model of computing. The server is a general purpose virtual machine/runtime system. The client is the program which makes ‘requests’ to the server and receives responses. And let's be completely abstract here. The client is capable of nothing but making requests. The server does all the work. If the client needs temporary storage for a variable, it requests that the server allocate it. If it wants to load a value into that variable, it requests that the server load the location. The client isn't much more than an instruction stream, but keep the image of an active, but extraordinarily primitive client in your mind.

The server, on the other hand, is amazingly powerful compared to the client, but totally ignorant. It simply does not know what to do or what it is doing. It just mechanically follows the requests of the client.

This is simply a description of the Harvard architecture, but I want to emphasize that each component --- the program (client) and the processor (server) --- sees the other as a black box. And although the processor is usually responsible for fetching instructions, I've made the client responsible for feeding them to the processor.

A general purpose processing machine combined with a program gives you a special purpose machine. Your x86 processor is general purpose, but Excel turns it into a spreadsheet processor. We can consider a program as a ‘meta-operator’ that transforms one machine into another. Instead of making requests directly to the machine, we make them to the program/machine combination. The program makes requests of the underlying machine and arranges for the underlying machine to deliver the answer in a useful format.

If you have a program written in a language that your machine doesn't natively understand you have two options to make it run. You can translate the program to a language the machine does understand, or you can translate the machine to a machine that understands the program. The first technique is compiling, the second is interpretation. An interpreter is a program that converts a machine in one language to a machine in another.

Interpretation almost always performs slower than compilation. The usual reason is given as ‘interpreter overhead’. But what is interpreter overhead? If we look at the model I described above, we have this situation if we have compiled our program:
compiler
program ---------> program'

+--------------------
| program' | machine
+--------------------
But if it is interpreted, we have this situation:
+---------------------------------
|         +-----------------------
| program | interpreter | machine
|         +-----------------------
+---------------------------------
Obviously the extra layer introduced by interpretation is a problem. But in our model, the client doesn't do anything. The machine is ultimately the thing that takes action. The client is essentially passive. The only reason that things would take longer is because the machine is doing more work in the second situation.
+---------------------------------------
|         +-----------------------------
| program | interpreter | machine
|         |             |  + excess work
|         +-----------------------------
+---------------------------------------
The interpreter must be making more requests of the machine than the compiled program. (I never miss the obvious.) That's the interpreter overhead.

Let's turn this around. Now we're take the point of view of the machine and we are getting instructions fed to us. The picture is like this:
-------------------+
program | machine
-------------------+
or perhaps it is like this:
+-------------+
----------+             |
program  | interpreter | machine
----------+             |
+-------------+
Another picture:
+--------------+
----------+              |
program  | interpreter  |   machine
----------+              |
+--------------+
|    |
+--------------+
| co processor |
+--------------+
I've attached an ‘interpreter co-processor’ to the picture. What does it do? It handles all the ‘interpreter overhead’. If the interpreter needs some temporary storage to hold a variable, or if it has to maintain a lookup table to find subroutines, the co-processor is the machine that handles that. On the other hand, if the interpreter needs to perform an addition because the user's program calls ‘+’, this is given directly to the machine. The co-processor is only there to help the interpreter, not to perform any work requested by the program.

From the machine's point of view, the only way we can tell if two programs are different is if the series of instructions being fed to us is different. We don't really know, or even care what is going on outside. We're just following orders, whether they come directly from a program or via an interpreter (which, after all, is just another program as far as we're concerned), or whether the interpreter has a funny co-processor attached.

Suppose we have two programs, X and Y, that have only a small difference between them. They both feed the exact same instructions in the exact same order to the machine, but program Y uses the interpreter co-processor half as much as program X. Since the instruction stream to the machine is identical in both, it should be the case that both programs ‘do’ the same thing. The machine is the only component that can take action, and you've told it to take the same action in both cases. But program Y uses fewer resources than program X. By making fewer demands of the co-processor, perhaps our machine can be cheaper or run cooler. The instructions that are given to the co-processor are not essential to running the program.

Our machine is universal, so we really don't need to buy a separate co-processor. We can emulate it.
+--------------+
----------+              +--+---------+
program  | interpreter  |  | machine |
----------+              +--+--+---+--+
+--------------+     |   |
|    |          |   |
+--------------------+   |
| co processor emulator  |
+------------------------+
How does this differ from the picture that looked like this?
+-------------+
----------+             |
program  | interpreter | machine
----------+             |
+-------------+
It doesn't, except that I can point at the instructions coming through the ‘emulated co-processor’ and identify them as ‘interpreter overhead’. And as I mentioned just above, those instructions aren't essential to running the program.

If we are not trying to ‘optimize’ the program, then a compiled program should be identical to an interpreted one, but without as much ‘interpreter overhead’. The overhead comes from the fact that we are emulating some of the behavior that the interpreter relies on. Compilation wins because we remove as much of the need for the emulation as we can. But some machines these days are pretty sophisticated. Do we need to emulate the desired behavior, or can we co-opt the machine's built-in behavior to approximate the desired behavior?

## Saturday, August 9, 2008

### Real life nastiness

Here's an example I stole from the support code for PLT Scheme. This function seems to check whether we've hit a symbol delimiter or comment.
int wxMediaStreamIn::IsDelim(char c)
{
if (scheme_isspace((unsigned char)c))
return 1;
else if (c == '#') {
long pos;
char next[1];
pos = f->Tell();
if (next[0] == '|') {
f->Seek(pos - 1);
return 1;
} else {
f->Seek(pos);
return 0;
}
} else if (c == ';') {
long pos;
pos = f->Tell();
f->Seek(pos - 1);
return 1;
} else
return 0;
}
This function might be skipping a token in the input stream. I'm not really sure.
void wxMediaStreamIn::SkipOne(int recur)
{
char buf[1];

if (recur) {
buf[0] = '#';
} else {
SkipWhitespace(buf);
}

if (buf[0] == '#') {
/* Byte string */
if (f->Read(buf, 1) == 1) {
if (buf[0] != '"') {
} else {
while (1) {
if (f->Read(buf, 1) != 1) {
break;
}
if (buf[0] == '"') {
break;
} else if (buf[0] == '\\') {
if (f->Read(buf, 1) != 1) {
break;
}
}
}
}
} else {
}
} else if (buf[0] == '(') {
/* List of byte strings */
do {
if (f->Read(buf, 1) != 1) {
break;
}
} while (!IsDelim(buf[0]));
if (buf[0] == ')')
break;
else if (buf[0] == '#') {
SkipOne(TRUE);
} else {
break;
}
}
} else {
/* Number */
do {
if (f->Read(buf, 1) != 1) {
break;
}
} while (!IsDelim(buf[0]));
}

IncItemCount();
}
}
Now suppose we wanted to change this code by in-lining the calls to IsDelim. This is not a trivial task. In fact, the calls to BAD_PRINTF seem to indicate that the author of the code had no easy time getting to this point. The duplication of code blocks shows that some in-lining has already been done.

Now consider this code. It is part of a compiler. It establishes an escape continuation upon entry to a block, and if possible, optimizes the code if it can determine either that the continuation is not used or that the call to the continuation is at the same dynamic depth as the continuation.
(define (bind-escape-continuation variable-name binder-type)
(let* ((block (block/make block '()))
(cont-var (binding/variable (variable/make&bind! block variable-name #f #f)))
(cond ((and (non-local-exit? result)
(eq? (non-local-exit/variable result) cont-var)
(not (variable/referenced? cont-var (non-local-exit/body result))))
(evil-message "Optimize:  Coalesce catch/throw" cont-var)
(non-local-exit/body result))
((not (variable/referenced? cont-var result))
(evil-message "Optimize:  Remove binding for non-local-exit" cont-var)
result)
(else (make binder-type
:block    block
:variable cont-var
:body     result))))))
And this function compiles a for statement. An escape continuation is necessary to handle calls to break and continuue.
(defmethod (compile-expression (target <void-target>) block (statement <c-for>))
(cons-expression
block
(compile-expression (make <effect-target> :parent target) block (for/initialize statement))
((bind-escape-continuation 'break <compiled-bind-break>)
block
(lambda (break-block)
(let* ((for-block (block/make break-block '()))
(for-var   (binding/variable (variable/make&bind! block 'for #f #f))))
(make-iteration
(variable/name for-var)
(make-sequence
(list (make-conditional
(compile-expression (make <predicate-target> :parent target)
for-block (for/predicate statement))
((bind-escape-continuation 'continue <compiled-bind-continue>)
for-block
(lambda (continue-block)
(let ((continue-binding (block/lookup-name continue-block 'continue #f)))
;; stow the step part.
(setf! (binding/value continue-binding)
(compile-expression (make <statement-target> :parent target)
break-block (for/step statement)))
(cons-expression continue-block
(compile-expression (make <statement-target> :parent target)
continue-block (for/body statement))
(make <non-local-exit>
:block continue-block
:variable (binding/variable continue-binding)
:body (binding/value continue-binding))))))
(make <non-local-exit>
:block for-block
:variable (binding/variable
(block/lookup-name break-block 'break #f))
:body #f))
(make <compiled-combination>
:operator (make <compiled-reference> :variable for-var)
:operands '()))
)))))))
Inlining the call to bind-escape-continuation is trivial. (Try it.) This isn't due to the problem domain unless character at a time parsing is that much more difficult than flow-control in a compiler.

One reason I chose Lisp or Scheme for experimental code is because I can move things around so much easier than in other languages. When I tinker with code I often decide to pull a chunk out, combine it with another, change my mind, put it back, change something to a function, move a binding, etc. Lisp simply does not get in my way as much as other languages.

### Done testing?

The previous post looked good to me on Blogger and under Google reader. Anyone have problems with it?

### Another test.

Some of the posts coming through a feed seem to be completely screwed up with this new stylesheet. I changed the stylesheet tag to use <pre> instead of a custom class and this seems to work on the blog itself. But when I went to the reader, one of the posts looked ok, but another looked wrong. I'm encouraged by the first, but maybe I'm being fooled by some cacheing that's going on somewhere. I'm going to make a few more test posts. If it works or fails, I'd like to know. (If you have hints, etc. It'd be nice, too.)
(define (iter tree))
(car efficiently
;; wite-char #\space)))
(if (> (cdr entrees)
(displambda (term)
(scand (pair? tree)
(if (leaf-node? thing)
(n (right-child tree)))
(n (right-child tree tree)
(if (leaf-node? tree tree)
(probabilith the assumption that all-terms))))))

(for-each-list-points terms an intep trees)
(begin
(define (do-term term)
(display b)
(let ((best-a alpha beta m N)
(do ((i 0 (+ i 1))) (if (/ (* *alpha* (gamma-point dp)
(gamma-ratio num 0)
(if (> rl (d (right-child right-child)
(do-it level tree)
(set ((pi_k))
(num den)
;; Compute gamma (n tree))
(hash-tree (cluster-data-points terms tree)
(d (left-child tree)))

Self-learning methodologies are particularly key when it comes to replicated theory. While conventional wisdom states that this issue is always surmounted by the understanding of congestion control, we believe that a different solution is necessary. Despite the fact that conventional wisdom states that this problem is always surmounted by the refinement of the UNIVAC computer, we believe that a different approach is necessary. We emphasize that CARPER requests XML, without investigating DNS. indeed, public-private key pairs and Moore's Law have a long history of connecting in this manner.

### A test post.

I've gotten sick of the way blogger is showing my code. I'm playing with the stylesheet to see if it can be improved.
void Run ()
{
// The etc argument isn't used.
// We want EvalStep to be an expression rather
// than a statement so we chain the return value
// around in case we think of a use in the
// future.
object etc = null;
while (true)
etc = this.expression.EvalStep (this, etc);
}
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
What happens if I try to do in C# what I did in Lisp a few days ago?

Here are the functions we'll need to compute our probabilities in C#:
double PH1 ()
{
return Utility.Product<string> (this.TermWeight, Globals.AllTerms);
}

double D ()
{
return Globals.TuningAlpha * Utility.Gamma (N ())
+ leftChild.D () * rightChild.D ();
}

double Pi ()
{
return Globals.TuningAlpha * Utility.Gamma (N ()) / D ();
}

double P ()
{
double pi_k = Pi ();
return pi_k * PH1 () + ((1.0 - pi_k) * leftChild.P () * rightChild.P ());
}

double R ()
{
return Pi () * PH1 () / P ();
}

double ProbabilityToOdds (double p)
{
return p / (1.0 - p);
}

double ROdds ()
{
return ProbabilityToOdds (R());
}
The marginal probability of clustering is given by function R, and the marginal odds of clustering is given by function ROdds.

In the Scheme version, I proceeded to convert the program to odds space by substituting the definitions of R, probability->odds, Pi, etc. and then rewriting and simplifying the resulting code. This is because in Scheme the name of the function and the function itself are interchangable.

Let's see how this works in C#. First let's try ProbabilityToOdds. In Scheme, we have this:
(define probability->odds
(lambda (p) (/ p (- 1.0 p))))

(define r-odds
(lambda (cluster)
(probability->odds (R cluster))))
And we simply replace the name with the definition:
(define r-odds
(lambda (cluster)
((lambda (p) (/ p (- 1.0 p))) (R cluster))))
But in C#, what exactly is ProbabilityToOdds defined as?
ProbabilityToOdds = ????
It's a static method, not a first-class object. But C# has delegate objects that can be used to act as if they were first-class methods. We can do this:
Func<double,double> P2O = ProbabilityToOdds;
And now we have a delegate for ProbabilityToOdds. The delegate is a first-class object. We can use it just like we used ProbabilityToOdds. For example
double ProbabilityToOdds (double p)
{
return p / (1.0 - p);
}

static Func<double,double> P2O = ProbabilityToOdds;

double ROdds ()
{
return P2O (R());
}
That didn't get us very far. We sort of want to go in the reverse direction. We want to make an unnamed delegate. You can do this in C#.
static Func<double,double> P2O =
delegate (double p)
{
return p / (1.0 - p);
}

double ROdds ()
{
return P2O (R());
}
This is a little weird because it seems that the delegate doesn't have a declared return type. It gets weirder. Let's substitute P2O with the delegate.
double ROdds ()
{
return delegate (double p)
{
return p / (1.0 - p);
} (R());
}

Compiler error:  ``Method name expected.''
It seems that the compiler is playing some tricks on us. Anonymous delegates weren't originally in the language, and they aren't directly supported by the runtime. However, the compiler understands what it is we're trying to accomplish and arranges to create a good simulation. At least in the first case. We confused it with the second. With a couple of hints to the compiler, we can say what we mean:
double ROdds ()
{
return ((Func<double,double>)
delegate (double p)
{
return p / (1.0 - p);
}) (R());
}
We can substitute R, too.
double ROdds1 ()
{
return ((Func<double, double>)
delegate (double p)
{
return p / (1.0 - p);
}) (((Func<double>) delegate ()
{
return Pi () * PH1() / P ();
}) ());
}
We'd better simplify this. It's getting a little messy. That second function takes no arguments and is being applied to an empty argument list. We can η-reduce it.
double ROdds1 ()
{
return ((Func<double, double>)
delegate (double p)
{
return p / (1.0 - p);
}) (Pi () * PH1() / P ());
}
It might not be immediately clear what I just did, so I'll re-set this.
double ROdds1 ()
{
return ((Func<double, double>)
delegate (double p)
{
return p / (1.0 - p);
}) (((Func<double>) delegate ()
{
return Pi () * PH1() / P ();
}) ());
}
η-reduction is removing the function call apparatus (in green) and using the computed value (in red) directly:
double ROdds1 ()
{
return ((Func<double, double>)
delegate (double p)
{
return p / (1.0 - p);
}) (Pi () * PH1() / P ());
}
When invoking a function, we compute the value of the operands, bind them to the formal paramaters, and evaluate the body. In essence, we do this:
double ROdds1 ()
{
return ((Func<double>)
delegate ()
{
double p = Pi () * PH1 () / P ();
return p / (1.0 - p);
}) ();
}
Now we can η-reduce, or we could if this were syntactically allowed:
double ROdds1 ()
{
return
double p = Pi () * PH1 () / P ();
p / (1.0 - p);
}
I removed the function call apparatus just as I did before, but this time it didn't work. The body of a function in C# consists of a series of statements, but a call invoking a function is an expression. You can't use a statement where an expression is expected. Then why did it work before? There was a single statement in the function body and it was a return statement. When we removed the return, we serendipitously converted the return statement to an expression. Since there were no preceeding expressions, we got away with it. In the current case, however, have a problem.

We'll have to move any statements that were in the body of our delegate to a context where statements are permitted. With luck, there will be a context close enough that it won't change the semantics of the code. As it turns out, we're in luck this time.
double ROdds1 ()
{
double p = Pi () * PH1 () / P ();
return

p / (1.0 - p);
}
We were lucky because the context of the call to the delegate was the first subexpression of a statement and we didn't have to move the code very far. But if it were something more complicated, like, say, this:
for (int i = 0; i < ((Func<int>)
delegate ()
{
object x = foo();
return bar (x, x);
})(); i++)
We're going to have more problems. The nearest place to hoist the binding of x is outside of loop, and this will likely change the semantics.

There is a hack to work around this limitation, but it is really very nasty. In C#, as in C, assignment is an expression rather than a statement. We can separate the binding of p from computing the value if we can figure out how to sequence the expressions.
double ROdds1 ()
{
double p;
return
First
p = Pi () * PH1 () / P ()
then
p / (1.0 - p);
}
In C and C++, you can sequence expressions with a comma. This is one of those things people use all the time in for statements, but forget that it can be used elsewhere as well. You would write this:
double ROdds1 ()
{
double p;
// In C or C++ this will work.
// NOT IN JAVA OR C#
return (
p = Pi () * PH1 () / P (),
p / (1.0 - p));
}
Unfortunately, Java and C# have overlooked the comma expression. Fortunately, they did not overlook the conditional expression. We can abuse it to force our sequencing because the predicate in the conditional is required to be evaluated first.
double ROdds1 ()
{
double p;
return ((p = Pi () * PH1 () / P ()) == 0.0 || true)
? p / (1.0 - p)
: 0.0;
}
The assignment is put in the predicate as part of boolean expression that is always true, then the second expression is put in the true branch of the conditional. The false branch is a dummy value of the correct type.

I did say it was very nasty, but sometimes programming is a dirty job.

Obviously, we don't need to go this far for the problem at hand, this:
double ROdds1 ()
{
double p = Pi () * PH1 () / P ();
return p / (1.0 - p);
}
will do just fine.

Ok. This is ridiculous. People don't do this sort of stuff when they hack C, C# or Java. The reason is obvious. It's far too hard and confusing, even in this toy example. It would be insane to try to do this in real life code. Instead, of just cutting and pasting code from here to there, we'd look at the destination context, reason about how the return value is going to be used and how it could be best computed in the new context, and then rewrite the code at the destination to have the desired effect.

More to follow...

## Thursday, August 7, 2008

### In progress

I'm working on another blog entry, but as part of the task I have ported the simple Bayesian Hierarchical Clustering — the low performance version — algorithm to C#. It is available here: http://code.google.com/p/jrm-code-project/source/browse/ under the BHCTest directory.

## Tuesday, August 5, 2008

The final trick to improving the performance is a change in how the basic clustering routine works. Heller and Ghahramani present this pseudocode:

initialize: number of clusters c = n, and
Di = {x(i)} for i = 1 ... n
while c > 1 do
Find the pair Di and Dj with the highest
probability of the merged hypothesis:

pik p(Dk|H1k)
rk = ----------------
p(Dk|Tk)

Merge Dk ← Di ∪ Dj, Tk ← (Ti, Tj)
Delete Di and Dj, c ← c - 1
end while

.
The naive implementation would examine every pair of clusters to find the one with the best score. On the next iteration, it would do it again. This adds up quick. Keeping these scores in a table would make it much quicker, but the table size becomes the issue.

But we're on the right track. The problem with a table, besides the size, is that most of the entries are useless. If clusters A and B score well together, we certainly don't care how poorly clusters A and Q or A and Z or A and J etc. score. So we'll keep a smaller table that has a single entry for each cluster that is the cluster that it scores best against. This table will start with the same number of entries as we have leaf nodes, and it will gradually grow smaller as we create bigger and bigger clusters.

Now the issue becomes how to cleverly maintain that table. First, let us assume the steady state that each entry in the table has a key which is a cluster and a value which is another cluster that produces the best score when these are combined. This will be our invariant. When we add a new cluster to the table, we have to do the following:
1. Find the other cluster that is the best match.

and
2. Update any cluster that would get a better score with the new entry than with the best it already had.

This operations can proceed in a single pass. We walk the table and test the score of merging the existing key with the new entry. We track the best we find as we walk the table. At each entry, however, we also check to see if the test score exceeds the score of that entry. If it does, then the new cluster becomes the best match for that entry, otherwise we leave the entry alone.

We start by putting 2 leaf nodes in the table, each of which matches the other as the best combination. Then we incrementally add each leaf node of our starting set. Once all the leaf nodes are added, we know the best match for each leaf.

After initializing the table, we'll iterate by selecting the best match in the table, deleting the matching elements from the table, actually performing the merge, and then inserting the merged cluster back into the table. Deleting the elements is a little tricky. Obviously, we will remove the two entries that are keyed by those elements, but it may be the case that other entries would be the best match against the ones we are deleting. We have to fix these up if we are to maintain the invariant. An easy way to maintain the invariant is to simply remove and then re-insert any entry that matches one of the two elements we are deleting.

This works pretty good, but we end up rescoring a lot of entries. Since each rescore involves scanning the entire table, that still adds up to quite a bit of work.

I'm somewhat lazy and I don't want to rescore entries unless I really need to. When an entry in the table has to be rescored because it's best matching cluster is being deleted, there's only two things that could happen: either the new cluster about to be added will be an even better score than the one we just deleted, in which case we'll be paired with that one when it gets entered, or whatever it is we'll be paired with will have an equal or worse score than what we had.

We'll change the invariant in this way. Each entry either contains the actual best cluster that the entry key should merge with, or it contains an upper bound on the best score that the key could have if we knew what it was. So rather than rescore an entry when it becomes invalid, we simply drop the matching element and retain the old score.

Now when we insert an element into the table, we still have to match it against the remaining elements. If an entry has a deferred score, and we equal or exceed that score when the new cluster is merged with the entry, we can simply replace the entry and update the score. If not, we retain the deferred score.

When we go to find the best entry in the table, we use the deferred scores as upper bound estimates of the actual score for that entry. If the best entry exceeds all the deferred scores, we can use it without determining the real match of the deferred scores. However, if we find some deferred scores that have an upper bound that exceeds the best we find in the table, then we really need to find out the actual best match for the deferred score. We do this by finally scanning the table for that entry.

Recall that each time we iterate, we make the table smaller by one. (We delete two nodes and add a single combined one). The longer we defer scoring an entry, the smaller the table gets and the fewer entries need to be scanned. This reduces the number of scoring calculations that we have to do.

Finally, we can use a little bit of knowledge about the kind of things we want to merge. In the problem I'm working on, each data point has only a handful of terms from a selection of thousands. I *know* that the data points won't score well against others unless they have common terms. By keeping a table that maps terms to clusters, I can skip computing scores for matches that simply won't happen. On the other hand, some data points are duplicates of others and obviously should be grouped together. There is no need to search for matches and invoke the full clustering machinery. We'll just take the duplicates and form initial clusters out of those before even populating the table.

So how much does all this help? I can now cluster 20000 data points by examining only 25 million possible clusters. This takes several minutes, but we're a long ways from the 20 or so data points that we started with.

## Monday, August 4, 2008

### And on a completely different subject

In addition to the Bayesian Hierarchical clustering, I have been playing with my port of MIT Scheme. The code is at http://code.google.com/p/jrm-code-project/source/checkout. Browse down to trunk/MIT-Scheme.

This is a port/rewrite of the MIT CScheme interpreter in C#. It can load and run both the MIT Scheme source code or the syntaxed "bin" files of MIT Scheme. Although it is far from complete, it can cold load MIT Scheme, get to a prompt and start evaluating expressions. Rather that using a huge switch statement like CScheme, this interpreter uses virtual method dispatch to direct evaluation. Tail recursion is achieved via a trampoline. Memory is managed by the CLR runtime.

Why did I do this? I wanted to find out a few things.
• Is virtual method dispatch fast enough to base an interpreter on?
• Is the CLR really unsuitable for dynamic languages, or is that a myth?
• Can MIT Scheme benefit from a more sophisticated GC like in the CLR?
• What tricks and techniques can we use to improve the performance of the interpreter itself?

### Dynamic Programming

At an interview I had a few years back I was given a question on Dynamic Programming. Fortunately I had written a diff program recently and was familiar with the term. Although the term has a broad meaning, the interviewer was looking for a particular approach to a particular problem. Let me outline that.

Given a sequence of objects, a subsequence is derived by deleting zero or more objects from the sequence. If our sequence is (a b c d e), one subsequence is (a b e), another is (b d). Given two sequences there exists at least one and possibly several sequences that are common to both. The longest common subsequence is of interest in programs like diff.

There is an obvious recursive way to find the length of the longest common subsequence:
(define (lcs seq1 seq2)
(cond ((or (null? seq1) (null? seq2)) 0)
((eq? (car seq1) (car seq2))
(+ 1 (lcs (cdr seq1) (cdr seq2))))
(else (let ((t1 (lcs seq1 (cdr seq2)))
(t2 (lcs (cdr seq1) seq2)))
(if (> t1 t2)
t1
t2)))))
.
Informally, we look at the head of each sequence. If the heads match, then that element is common. If not, then either we should drop the head of the first or we should drop the head of the second. We try both and pick the longer one.

This is not a very efficient algorithm. There is a lot of repeated computation. To save time and computation, we have a trick. We create an N by M array to hold the intermediate results of the recursive calls. Along the top of the array we put sequence1 in reverse order. Along the left side we put sequence 2 in reverse order. Now we can fill out the cells in the array. In the first row and the first column we put zero. This corresponds to the first clause in our conditional. In the remaining cells we do this: if the object at the top of the column matches the object in the left of the row, we add 1 to the value in the cell diagonally up and to the left of us. Otherwise, we take the maximum of the cell immediately to the left and the cell immediately above. We fill the array from left to right, top to bottom until we get to the end cell which now contains the answer.

This is a cute trick, but it is unnecessarily complicated. If you instead take the simple, obvious recursive solution and just memoize the results as they are computed, the result will be computed in the same order and with the same performance improvement as with generating an explicit table. In the interview I had, the interviewer didn't know that and was expecting me to write out a table of intermediate results. He kept hinting at it until I had to tell him that simply adding memoization to the recursive solution would yield the same answer without having to reconfigure the problem to work from bottom up in a table.

The next speedup to the Bayesian Hierarchical Clustering comes from this exact trick. When we compute the values of ph1-log and px-log for a subtree, we simply memoize them in the tree structure itself. The next time we need the value we can simply fetch it from the stored location. We can make this even more efficient by computing the value at the time of tree creation so we don't have to check if the value is memoized — it will always be present in the tree structure. If we change the code in this way, we'll see something amusing. Although conceptually we are computing the values top-down as we build the tree, we are, through memoization, computing the values in bottom-up order.

Memoization also helps significantly for log-bernoulli-beta. Because we are building our clusters from bottom up, a significant amount of time is spent checking small clusters and leaf nodes. This means that the two final arguments to log-bernoulli-beta are usually small integers. Rather than compute the values on demand, it is much quicker to pre-compute the values for small integers and simply fetch them from the table.

These improvements to the code add about another factor of ten, so we can now cluster about one or two thousand data points in a reasonable amount of time. This turned out to be not quite enough for my problem and I needed to squeeze out some more performance. That is in the next post.
Last time I converted the original Bayesian Hierarchical Clustering code to work in odds space rather than in probability space. This was the first part of converting the entire program to work in log-odds space. Converting to log space is somewhat similar to the previous conversion, but there are a couple of differences. In converting to odds space, we mostly had a hairball of simple algebra to untangle. When we convert to log space we'll have to convert more of the program and pay attention to the operators.

First we convert the top-level function r-odds:

(define (r-log-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(10log10 (/ (ph1a cluster)
(* (px (left-child cluster))
(px (right-child cluster)))))))

.
Recall that the log of a ratio is the difference of the logs of the numerator and denominator and the log of a product is the sum of the logs of the multiplicands. We can rewrite it like this:

(define (r-log-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(- (10log10 (ph1a cluster))
(+ (10log10 (px (left-child cluster)))
(10log10 (px (right-child cluster)))))))

.
And now we'll write log versions of ph1a and px, so we have:

(define (r-log-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(- (ph1a-log cluster)
(+ (px-log (left-child cluster))
(px-log (right-child cluster))))))

(define (px-log cluster)
(if (leaf-node? cluster)
(ph1a-log cluster)
(10log10 (+ (ph1a cluster)
(* (px (left-child cluster))
(px (right-child cluster)))))))

.
Unfortunately, the log of a sum doesn't have an easy transformation. On the other hand, it isn't all that hard, either.

(define (log-sum left right)
(if (> right left)
(log-sum right left)
(+ left (10log10 (+ 1.0 (inverse-10log10 (- right left)))))))

.
If the difference between left and right is small enough, taking the inverse log (exponentiating) won't be much of an issue, but if the difference gets big, this could cause us some problems. For the moment, we'll just leave it and hope we don't have to revisit it later.

(define (px-log cluster)
(if (leaf-node? cluster)
(ph1a-log cluster)
(log-sum (ph1a-log cluster)
(+ (px-log (left-child cluster))
(px-log (right-child cluster))))))

(define (ph1a-log cluster)
(+ (10log10 *alpha*)
(log-gamma (n cluster))
(ph1-log cluster)))

.
We're hardly the first to want to compute in log odds space. The log-gamma function is fairly well-known and there are good approximations to it.

(define (ph1-log cluster)
(let ((data-points (cluster-data-points cluster)))
(sum (lambda (term)
(let ((present (count-occurrances term data-points))
(absent (- (length data-points) present))
(log-bernoulli-beta A B present absent))))
all-terms))
)

.
Ph1-log is like ph1, but we take the sum over the term rather than the product. Log-bernoulli-beta is trivially defined as the sums and differences of the appropriate log-gamma calls.

This part of the conversion was incredibly easy, but I did gloss over a couple of interesting things. I rather cavalierly propagated the log function through conditionals:

(log (if <predicate>          (if <predicate>
<consequence>    =>      (log <consequence>)
<alternative>))          (log <alternative>))

.
That turns out to be a legal transformation, but it would be nice to understand why. Suppose instead of a conditional, I had a different operation that combined two functions.

(log (op f1 f2))  <=!=>  (op (log f1) (log f2)) **WRONG

.
I cannot simply rewrite this as (op (log f1) (log f2)). If op were *, this would be clearly wrong. What I need to do is find an operation in log-space that has the effect of op after exponentiation.

(log (op f1 f2))  <===>  ((logequiv op) (log f1) (log f2))

(logequiv *) = +
(logequiv /) = -
(logequiv +) = log-sum

(logequiv if) = ????

.
We know that * becomes +, / becomes -, + becomes log-sum, but we should find the log-space equivalent of ‘if’.

(logequiv if) = if

.
It turns out that if simply stays if, so the transformation above is legal.

With the change from exact probabilities to floating-point operations in log space, we get about a factor of 10 improvement in performance. Now we can cluster hundreds of data points rather than only dozens. There are two more substantial speedups to apply.