Tuesday, August 5, 2008

The final trick to improving the performance is a change in how the basic clustering routine works. Heller and Ghahramani present this pseudocode:

  initialize: number of clusters c = n, and
          Di = {x(i)} for i = 1 ... n
  while c > 1 do
    Find the pair Di and Dj with the highest
        probability of the merged hypothesis:

                     pik p(Dk|H1k)
               rk = ----------------

     Merge Dk ← Di ∪ Dj, Tk ← (Ti, Tj)
     Delete Di and Dj, c ← c - 1
  end while

The naive implementation would examine every pair of clusters to find the one with the best score. On the next iteration, it would do it again. This adds up quick. Keeping these scores in a table would make it much quicker, but the table size becomes the issue.

But we're on the right track. The problem with a table, besides the size, is that most of the entries are useless. If clusters A and B score well together, we certainly don't care how poorly clusters A and Q or A and Z or A and J etc. score. So we'll keep a smaller table that has a single entry for each cluster that is the cluster that it scores best against. This table will start with the same number of entries as we have leaf nodes, and it will gradually grow smaller as we create bigger and bigger clusters.

Now the issue becomes how to cleverly maintain that table. First, let us assume the steady state that each entry in the table has a key which is a cluster and a value which is another cluster that produces the best score when these are combined. This will be our invariant. When we add a new cluster to the table, we have to do the following:
  1. Find the other cluster that is the best match.

  2. Update any cluster that would get a better score with the new entry than with the best it already had.

This operations can proceed in a single pass. We walk the table and test the score of merging the existing key with the new entry. We track the best we find as we walk the table. At each entry, however, we also check to see if the test score exceeds the score of that entry. If it does, then the new cluster becomes the best match for that entry, otherwise we leave the entry alone.

We start by putting 2 leaf nodes in the table, each of which matches the other as the best combination. Then we incrementally add each leaf node of our starting set. Once all the leaf nodes are added, we know the best match for each leaf.

After initializing the table, we'll iterate by selecting the best match in the table, deleting the matching elements from the table, actually performing the merge, and then inserting the merged cluster back into the table. Deleting the elements is a little tricky. Obviously, we will remove the two entries that are keyed by those elements, but it may be the case that other entries would be the best match against the ones we are deleting. We have to fix these up if we are to maintain the invariant. An easy way to maintain the invariant is to simply remove and then re-insert any entry that matches one of the two elements we are deleting.

This works pretty good, but we end up rescoring a lot of entries. Since each rescore involves scanning the entire table, that still adds up to quite a bit of work.

I'm somewhat lazy and I don't want to rescore entries unless I really need to. When an entry in the table has to be rescored because it's best matching cluster is being deleted, there's only two things that could happen: either the new cluster about to be added will be an even better score than the one we just deleted, in which case we'll be paired with that one when it gets entered, or whatever it is we'll be paired with will have an equal or worse score than what we had.

We'll change the invariant in this way. Each entry either contains the actual best cluster that the entry key should merge with, or it contains an upper bound on the best score that the key could have if we knew what it was. So rather than rescore an entry when it becomes invalid, we simply drop the matching element and retain the old score.

Now when we insert an element into the table, we still have to match it against the remaining elements. If an entry has a deferred score, and we equal or exceed that score when the new cluster is merged with the entry, we can simply replace the entry and update the score. If not, we retain the deferred score.

When we go to find the best entry in the table, we use the deferred scores as upper bound estimates of the actual score for that entry. If the best entry exceeds all the deferred scores, we can use it without determining the real match of the deferred scores. However, if we find some deferred scores that have an upper bound that exceeds the best we find in the table, then we really need to find out the actual best match for the deferred score. We do this by finally scanning the table for that entry.

Recall that each time we iterate, we make the table smaller by one. (We delete two nodes and add a single combined one). The longer we defer scoring an entry, the smaller the table gets and the fewer entries need to be scanned. This reduces the number of scoring calculations that we have to do.

Finally, we can use a little bit of knowledge about the kind of things we want to merge. In the problem I'm working on, each data point has only a handful of terms from a selection of thousands. I *know* that the data points won't score well against others unless they have common terms. By keeping a table that maps terms to clusters, I can skip computing scores for matches that simply won't happen. On the other hand, some data points are duplicates of others and obviously should be grouped together. There is no need to search for matches and invoke the full clustering machinery. We'll just take the duplicates and form initial clusters out of those before even populating the table.

So how much does all this help? I can now cluster 20000 data points by examining only 25 million possible clusters. This takes several minutes, but we're a long ways from the 20 or so data points that we started with.