Monday, June 18, 2007

If you have studied information theory, you might have noticed the question-bits procedure in the last post. It is the binary case ofthe Shannon Entropy. This makes sense: the entropy is the amount of information we can get out of the answer to a yes-or-no question when we have already been told the cluster. This also supplies us with a good idea of what makes a good cluster: the entropy after clustering should be as small as possible. That is, if the cluster is very homogeneous, then there aren't very many good yes-or-no questions that can narrow down your guess. Everything in the cluster has the same answer to almost all the questions. This would be a good clustering. If the cluster is very heterogeneous, then you'd expect a yes-or-no question to divide the cluster further. This would be a poor clustering. As it turns out, other people have been down this path: Barbara, Couto, and Li wrote a paper describing `COOLCAT: An entropy-based algorithm for categorical clustering'. So I guess I gotta write some code to try this out....