Abstract Heresies

...continued. I forgot something in the last post. I wanted to illustrate this process in code and I didn't write the code for the first equation. Here it is:


  (define (ph1 cluster)
    (let ((data-points (cluster-data-points cluster)))
      (product (lambda (term)
                 (let ((present (count-occurrances term data-points))
         (absent (- (length data-points) present))
                   (bernoulli-beta A B present absent))))
                all-terms)))

.
We assume that clusters contain a number of data points. For each term that could appear in a data point, we count how many times it actually appears and how many times it is absent. We compute the probability of the observed term given the model parameters A and B, and we take the product across all the terms.

Why Lisp? Or in this case, why Scheme? Lisp is the best language for doing something when you don't know what you are doing. There are three features I'm depending upon here.

Lisp and Scheme dispense with most of the minutae and bookkeeping involved in making the code run. I don't have to declare anything before I can use it, I can refer to routines I haven't written yet, I can run partially written code.
Self-descriptive data. All data in Lisp and Scheme have explicit tags. This makes it easy to poke at data structures and the partial results of computations to find out what it looks like and see if things sort of make sense.
Mixing compiled and interpreted code. Once I know something is working write, I can compile the know good code and make it fast. Or I can reload the interpreted code if I need to re-examine what's going on. Interpreted code and compiled code mix seamlessly. MIT Scheme has a powerful debugger that keeps track of what source code corresponds to compiled sections, and what stack positions correspond to what variables so it is easy to figure out what is going on. The MIT Scheme compiler generates very fast code and the sophisticated linking system can cache variable and function references so most function calls can avoid indirect calls through a function cell. (Did I mention it was fast?)

The next few paragraphs of Heller and Ghahramani discuss how to determine the alternative to the hypothesis that a selection of data points belong to a cluster. Rather than enumerating all the possible partitions of the data, however, they give an efficient closed-form equation for the marginal probability that the data belongs to a ‘tree consistent’ subset of the partitions. This is a key point to the paper. Unfortunately, this key point is a bit obscure and it took me a while to puzzle it out.

The equation is recursive and the authors seem to think this makes it more difficult to understand. They break the equation up into several parts, but this only makes it harder to put the pieces back together. First they present this equation:


      p(D_k|H2_k) = p(D_i|T_i) * p(D_j|T_j)

.
Huh? This seems easy, we're just multiplying something by something else, but where do the multiplicands come from? That's defined later. Ok, we'll wait a bit. Now the authors define a new variable:


        pi_k = p(H1_k)

.
Which is basically the prior probability of the first hypothesis without taking into account the data.

Finally, the authors present the equation we're waiting for:


  p(D_k|T_k) = pi_k p(D_k|H1_k) + (1 - pi_k) p(D_i|T_i) p(D_j|T_j)

.
“Ah”, you say, “this makes it all clear!” No? I didn't either. Here's part of the problem. By putting this in mathematical form, we've quite effectively hidden what's being recursively defined here. Furthermore, there are still some free variables floating around, and one of them is simply defined as yet a different one! (Note to authors, renaming one free variable as another does not help.) Let me rewrite this.

First, assume that a ‘cluster’ is either a ‘tree’ with a left and right subcluster, or it is a leaf consisting of a single data point. If it is a tree, then these are the equations:


  (define (ph2 cluster)
    (* (pd (cluster-left-child cluster))
       (pd (cluster-right-child cluster))))

  (define (pd cluster)
      "Compute the probability of the observed data given the tree
      structure of the cluster."
      (let ((pi_k (pi cluster)))
        ;; pi_k is in the range [0,1]
        ;; It determines how much weight we give to each
        ;; alternative.
        (+ (*      pi_k  (ph1 cluster))
           (* (- 1 pi_k) (ph2 cluster)))))

.
So we're recursively descending a tree structure! It turns out that the functions we're computing and the arguments we're using are cleverly disguised as subscripts.

There is a catch in the code above. The recursion in pd isn't well grounded. What happens when we hit the leaves of the tree? We'll find out in a moment.

What about this pi function? It isn't defined it yet. In fact, it isn't defined until two pages later in the middle of the proof about the validity of using tree-consistent partitioning. Well, better late than never. The authors define pi_k through an inductive algorithm:


    initialize each leaf i to have d_i = alpha, pi_i = 1

    for each internal node k do
      d_k = alpha * gamma (n_k) + d_leftk * d_rightk

              alpha * gamma (n_k)
      pi_k = ---------------------
                   d_k

    end for

.
Where n_k is the number of data points at internal node k and alpha is a tuning parameter for the whole algorithm.

I'll forgive the sudden introduction of the ‘left’ and ‘right’ subscripts (why now, when ‘i’ and ‘j’ were used before?), and at least they explain the free variables. The big problem is that this since this algorithm is inductive (that is, bottom up) and the equations above are co-inductive (top down), you can't just plug them together.

Fortunately, it isn't hard to recast this algorithm as a set of trivial recursive functions:


  (define (n cluster)
    (if (leaf-node? cluster)
        1
        (+ (n (left-child cluster))
           (n (right-child cluster)))))

  (define (d cluster)
    (if (leaf-node? cluster)
        *alpha*               ;; tuning parameter
        (+ (* *alpha* (gamma (n cluster)))
           (* (d (left-child cluster))
              (d (right-child cluster))))))

  (define (pi cluster)
    (if (leaf-node? cluster)
        1
        (/ (* *alpha* (gamma (n cluster)))
           (d cluster))))

.
Now we can see how to fix the recursion in pd. If we are at a leaf node the value of pi is 1 and the sum


        (+ (*      pi_k  (ph1 cluster))
           (* (- 1 pi_k) (ph2 cluster)))

.
becomes


        (+ (* 1 (ph1 cluster))
           (* 0 (ph2 cluster)))

.
So we can simply fix pd to avoid the call to ph2 in the base case:


  (define (pd cluster)
      "Compute the probability of the observed data given the tree
      structure of the cluster."
      (if (leaf-node? cluster)
          (ph1 cluster)
          (let ((pi_k (pi cluster)))
            (+ (*      pi_k  (ph1 cluster))
               (* (- 1 pi_k) (ph2 cluster))))))

.
Now we can jump back a page in the paper to where they define r_k. This isn't as well disguised:




                            pi_k * p(D_k|H1_k)
  r_k = ----------------------------------------------------------- 
          pi_k * p(D_k|H1_k) + (1 - pi_k) * p(D_i|T_i) * p(D_j|T_j)

.
We can rewrite this simply as:


        (define (r cluster)
          (/ (* (pi cluster) (ph1 cluster))
             (pd cluster)))

.
Here again are the core components:


  (define (r cluster)
    "The marginal probability of this cluster given the data.
     We use this value to determine which clusters to merge."
    (/ (* (pi cluster) (ph1 cluster))
       (pd cluster)))

  (define (pi cluster)
    "Recursively defined weight of merging vs. non-merging hypotheses."
    (if (leaf-node? cluster)
        1
        (/ (* *alpha* (gamma (n cluster)))
           (d cluster))))

  (define (pd cluster)
      "Compute the probability of the observed data given the tree
      structure of the cluster."
      (if (leaf-node? cluster)
          (ph1 cluster)
          (let ((pi_k (pi cluster)))
            (+ (*      pi_k  (ph1 cluster))
               (* (- 1 pi_k) (ph2 cluster))))))

  (define (ph1 cluster)
    ;; Hypothesis 1.  Data is generated uniformly in cluster.

    ;; User defined depending on the generative model.
    ;; Possible definition for bernoulli-beta model: 

    (let ((data-points (cluster-data-points cluster)))
      (product (lambda (term)
                 (let ((present (count-occurrances term data-points))
         (absent (- (length data-points) present))
                   (bernoulli-beta A B present absent))))
                all-terms))

     ;; Other models may be appropriate.
     ) 

  (define (ph2 cluster)
    ;; Hypothesis 2.  Data is generated from two or more of
    ;;  the tree consistent subclusters.
    (* (pd (cluster-left-child cluster))
       (pd (cluster-right-child cluster))))

  (define (n cluster)
    (if (leaf-node? cluster)
        1
        (+ (n (left-child cluster))
           (n (right-child cluster)))))

  (define (d cluster)
    (if (leaf-node? cluster)
        *alpha*               ;; tuning parameter
        (+ (* *alpha* (gamma (n cluster)))
           (* (d (left-child cluster))
              (d (right-child cluster))))))

.
The rest of the process is pretty simple. We start with only leaf nodes and we examine them pair by pair. We create a ‘test cluster’ out of each pair of leaf nodes and compute the value of r for the test cluster. After examining all pairs, we keep the cluster with the highest r and discard the leaves that it contains. We can either continue this process until we have one cluster with everything in it or until every test cluster gives us an r value of .5 or less (in which case we'll end up with a set of disjoint clusters).

Some ranting.

It took me weeks to tease apart the equations in the paper and determine what is really going on. My claim is that the code I wrote above should be easily understood by most practiced coders, and even those that can't understand the code should be able to type it in (transliterated to the computer language of their choice), write the support code, and run it to see what it does.

The code is exactly as mathematical as the equations in the paper, but it is much clearer about what is being defined, what is being operated upon, and what results are being generated.

Here are my gripes:

The equations are unclear for several reasons. The values computed are encoded in the subscripts, the mutually recursive definitions are spread across three different pages, the free variables aren't defined.
The one place where values are computed via an algorithm is using the opposite form of induction of the rest of the paper.
The pseudo-code can't run. Choose a good functional language like lisp and the pseudo-code will be actual code.
The pseudo-code is gratuitously ‘bash and smash’ imperative code. The values computed are naturally functionally inductive over the tree structure, the code should be that way too.
The recursive equations are presented without a base case, and the base case is hidden in the pseudo-code (with the wrong induction so we can't easily see that it grounds the recursion!)

Enough for today.

I got this part of the code working several weeks ago, but I ran into some more problems that are beyond the scope of the paper, but are interesting nonetheless. I'll talk about them soon.

Abstract Heresies

Tuesday, July 29, 2008

No comments: