Thursday, April 10, 2008

Oooh, code...

I have 58000 data points that I want to cluster. Each data point consists of a set of attributes. The attributes are simply binary — a data point either has the attribute or not. There are 84000 attributes all told, but each data point usually has only a few. No data point has more than about 100.

I found a paper on Spectral Clustering that looks interesting. The basic idea is to construct a similarity graph from the data points and then cut the graph into parts where there are minimal connections. It turns out that you can find good cut points by looking at the second eigenvector of the similarity matrix. The problem is that a (58000 x 84000) matrix has more than 4 billion entries. So you have to use some tricks.

Imagine that you have a huge spreadsheet. Each column is labeled with an attribute. A data point can be represented by a row in the spreadsheet: you just put a checkmark in the appropriate column if that data point has the attribute. Since no data point has more than about 100 attributes, there shouldn't be more than about 100 checkmarks in each row. Here's a simple example:
AvailableHas > 1GBLinux
box1check check
box2checkcheckcheck
box3checkcheckcheck
In Scheme, we can represent this as a vector of vectors. Using a tiny data set, it looks like this:
#(#(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1)
  #(0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
  #(1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0))
.
We first want to make table that describes the ‘similarity’ between data points. For every pair of data points, we measure how similar they are. Each data point will have both a row and a column. We can just look up the similarity by finding the row for one of the data points and the column for the other and looking at the intersection. The similarity matrix for the above looks like this:
#(#(13  5  5  5  5  5  5  5 0 0)
  #( 5 15 15  5  5  5  5  6 0 0) 
  #( 5 15 16  5  5  5  5  6 0 0)
  #( 5  5  5 15 15 15 15  5 0 0)
  #( 5  5  5 15 22 21 18  5 0 0)
  #( 5  5  5 15 21 22 18  5 0 0)
  #( 5  5  5 15 18 18 18  5 0 0)
  #( 5  6  6  5  5  5  5 13 0 0)
  #( 0  0  0  0  0  0  0  0 4 3)
  #( 0  0  0  0  0  0  0  0 3 4))
.
If we have represented our data points like above, the similarity matrix is the result of multiplying the data point matrix by its transpose. (That is, we take the dot product of each row of the data point matrix with every other row. This is also the same as simply counting the number of common terms.)

We want to normalize the above graph by making the sum of the terms in each row add up to one. First we sum the terms in each row:

#(#(13  5  5  5  5  5  5  5 0 0)    48
  #( 5 15 15  5  5  5  5  6 0 0)    61
  #( 5 15 16  5  5  5  5  6 0 0)    62
  #( 5  5  5 15 15 15 15  5 0 0)    80
  #( 5  5  5 15 22 21 18  5 0 0)    96
  #( 5  5  5 15 21 22 18  5 0 0)    96
  #( 5  5  5 15 18 18 18  5 0 0)    89
  #( 5  6  6  5  5  5  5 13 0 0)    50
  #( 0  0  0  0  0  0  0  0 4 3)     7
  #( 0  0  0  0  0  0  0  0 3 4))    7
.
Then we construct the diagonal matrix R:
#(#(48 0 0 0 0 0 0 0 0 0) 
  #(0 61 0 0 0 0 0 0 0 0) 
  #(0 0 62 0 0 0 0 0 0 0) 
  #(0 0 0 80 0 0 0 0 0 0) 
  #(0 0 0 0 96 0 0 0 0 0)
  #(0 0 0 0 0 96 0 0 0 0) 
  #(0 0 0 0 0 0 89 0 0 0) 
  #(0 0 0 0 0 0 0 50 0 0) 
  #(0 0 0 0 0 0 0 0 7 0)  
  #(0 0 0 0 0 0 0 0 0 7))
.
Then we invert R.
#(#(1/48 0 0 0 0 0 0 0 0 0) 
  #(0 1/61 0 0 0 0 0 0 0 0) 
  #(0 0 1/62 0 0 0 0 0 0 0) 
  #(0 0 0 1/80 0 0 0 0 0 0) 
  #(0 0 0 0 1/96 0 0 0 0 0)
  #(0 0 0 0 0 1/96 0 0 0 0)
  #(0 0 0 0 0 0 1/89 0 0 0) 
  #(0 0 0 0 0 0 0 1/50 0 0) 
  #(0 0 0 0 0 0 0 0 1/7 0)  
  #(0 0 0 0 0 0 0 0 0 1/7))
.
Finally, we multiply this against our similarity matrix.
#(#(13/48 5/48 5/48 5/48 5/48 5/48 5/48 5/48 0 0)
  #(5/61 15/61 15/61 5/61 5/61 5/61 5/61 6/61 0 0)
  #(5/62 15/62 8/31 5/62 5/62 5/62 5/62 3/31 0 0)
  #(1/16 1/16 1/16 3/16 3/16 3/16 3/16 1/16 0 0)
  #(5/96 5/96 5/96 5/32 11/48 7/32 3/16 5/96 0 0)
  #(5/96 5/96 5/96 5/32 7/32 11/48 3/16 5/96 0 0)
  #(5/89 5/89 5/89 15/89 18/89 18/89 18/89 5/89 0 0)
  #(1/10 3/25 3/25 1/10 1/10 1/10 1/10 13/50 0 0)
  #(0       0    0    0    0    0    0 0 4/7 3/7)
  #(0       0    0    0    0    0    0 0 3/7 4/7))
.
This is a really roundabout way to divide every term in a row by the sum of the entire row, but there's a reason for this sort of insanity.

Now we want to compute the second eigenvector this thing. This isn't easy to do, so instead we'll find a related matrix which has an easy one to compute. Starting with our row sums, we normalize, take the reciprocals, then the square roots and make a diagonal matrix:
#(48 61 62 80 96 96 89 50 7 7)

;; normalize
#(12/149 61/596 31/298 20/149 24/149 24/149 89/596 25/298 7/596 7/596)

;; Square root
(.28 .32 .32 .37 .4 .4 .39 .29 .11 .11)

;; Diagonalize
((.28 0 0 0 0 0 0 0 0 0) 
 (0 .32 0 0 0 0 0 0 0 0)
 (0 0 .32 0 0 0 0 0 0 0)
 (0 0 0 .37 0 0 0 0 0 0)
 (0 0 0 0 .4 0 0 0 0 0)
 (0 0 0 0 0 .4 0 0 0 0)
 (0 0 0 0 0 0 .39 0 0 0)
 (0 0 0 0 0 0 0 .29 0 0)
 (0 0 0 0 0 0 0 0 .11 0)
 (0 0 0 0 0 0 0 0 0 .11))
.
And we compute matrix Q by multiplying our normalized similarity matrix by the square root matrix on the left and multiplying that by the inverse of the square root matrix on the right.
((.27 .09 .09 .08 .07 .07 .08 .1 0. 0.)
 (.09 .25 .24 .07 .07 .07 .07 .11 0. 0.)
 (.09 .24 .26 .07 .06 .06 .07 .11 0. 0.)
 (.08 .07 .07 .19 .17 .17 .18 .08 0. 0.)
 (.07 .07 .06 .17 .23 .22 .19 .07 0. 0.)
 (.07 .07 .06 .17 .22 .23 .19 .07 0. 0.)
 (.08 .07 .07 .18 .19 .19 .2 .07 0. 0.)
 (.1 .11 .11 .08 .07 .07 .07 .26 0. 0.)
 (0. 0. 0. 0. 0. 0. 0. 0. .57 .43)
 (0. 0. 0. 0. 0. 0. 0. 0. .43 .57))
.
The matrix Q is symmetric! This allows us to find the first eigenvector easily. (It's that diagonal matrix we just constructed.) Recall that the interesting thing about an eigenvector is that it remains unchanged when we multiply it by the matrix.
first-eigenvector
(.28 .32 .32 .37 .4 .4 .39 .29 .11 .11)

(matrix-times-vector q first-eigenvector)
(.28 .32 .32 .37 .4 .4 .39 .29 .11 .11)
.
So how do we find the second eigenvector? Well, if we start with a vector that is orthogonal to the first eigenvector, normalize it, and find the fixed point of multiplying by Q, we'll get the second eigenvector. (If it starts orthogonal to the first eigenvector, it will remain orthogonal, but converge to the second eigenvector).
first-eigenvector
(.28 .32 .32 .37 .4 .4 .39 .29 .11 .11)

orthogonal-vector
(.11 .11 .29 .39 .4 -.4 -.37 -.32 -.32 -.28)

normalized
(.11 .11 .29 .39 .4 -.4 -.37 -.32 -.32 -.28)

iterative multiplication by q and normalizing
(.082 .169 .178 .043 .022 .003 .013 -.06 -.687 -.675)
(.055 .094 .096 .042 .04 .04 .041 .037 -.698 -.696)
(.049 .067 .068 .05 .052 .052 .051 .048 -.699 -.698)
(.046 .057 .057 .054 .058 .058 .056 .047 -.699 -.699)
(.045 .053 .053 .056 .06 .06 .058 .046 -.699 -.699)
(.044 .051 .051 .056 .061 .061 .059 .046 -.699 -.699)
(.044 .05 .051 .057 .062 .062 .06 .045 -.699 -.699)
(.044 .05 .05 .057 .062 .062 .06 .045 -.699 -.699)
(.044 .05 .05 .057 .062 .062 .06 .045 -.699 -.699)
.
This second eigenvector is related to the second eigenvector of our original similarity matrix. We just multiply it on the left by the inverse of the diagonal square root matrix.

But wait! There's more... tomorrow.