I was trying to get MIT Scheme (that is, the C substrate of MIT Scheme) to compile on my Windows laptop. I got frustrated. autoconf, m4, perl, etc. etc.

Instead, I decided to rework my C# version. Over the weekend I got it to read the fasl file ‘make.bin’.

## Wednesday, April 30, 2008

## Monday, April 28, 2008

### ‘Getting’ CLOS

Eric asked me:

I'm not sure I'm exactly the right person to ask. It took me years to ‘get’ CLOS. It just rubbed me the wrong way and seemed to be gratuitously verbose and complex. I think it ‘clicked’ for me when I got painted into a corner as I was trying to update the schema of some persistent objects. I discovered that I could rename the slot of the legacy object without renaming the accessors and arrange for the accessors to act differently on legacy objects versus new ones. It was completely transparent, almost as if it were designed for this purpose, which it was.

I guess my advice would be this. Keep in mind that CLOS objects are ‘structs on steroids’. Whenever you would normally think of using a struct, use a class. Use the high-level accessors that DEFCLASS defines. Avoid using SLOT-VALUE or WITH-SLOTS (these are low-level accessors). Avoid using fancy method combination. Anyplace you think you need an :AROUND method, you probably just need to invoke CALL-NEXT-METHOD in a normal method.

I should probably write up some guidelines, but that'll have to wait for another day.

*I have still yet to really "get"/appreciate CLOS. Any tips on reaching this end?*I'm not sure I'm exactly the right person to ask. It took me years to ‘get’ CLOS. It just rubbed me the wrong way and seemed to be gratuitously verbose and complex. I think it ‘clicked’ for me when I got painted into a corner as I was trying to update the schema of some persistent objects. I discovered that I could rename the slot of the legacy object without renaming the accessors and arrange for the accessors to act differently on legacy objects versus new ones. It was completely transparent, almost as if it were designed for this purpose, which it was.

I guess my advice would be this. Keep in mind that CLOS objects are ‘structs on steroids’. Whenever you would normally think of using a struct, use a class. Use the high-level accessors that DEFCLASS defines. Avoid using SLOT-VALUE or WITH-SLOTS (these are low-level accessors). Avoid using fancy method combination. Anyplace you think you need an :AROUND method, you probably just need to invoke CALL-NEXT-METHOD in a normal method.

I should probably write up some guidelines, but that'll have to wait for another day.

## Thursday, April 17, 2008

### What was the question?

The spectral clustering was a disappointment. It really didn't find anything that wasn't glaringly obvious and it missed several features that I could see by eye.

I think the problem here is that I'm not very clear on the question that I want answered. I have some data, I see some patterns, but I'm not sure what `answer' I want to get. Some patterns in the data are `interesting', others are not. I can't highlight the interesting ones unless I can formally specify

Programming is hard. Thinking about what you should program is even harder. Maybe I should go shopping.

I think the problem here is that I'm not very clear on the question that I want answered. I have some data, I see some patterns, but I'm not sure what `answer' I want to get. Some patterns in the data are `interesting', others are not. I can't highlight the interesting ones unless I can formally specify

*why*they are interesting in a quantifiable way.Programming is hard. Thinking about what you should program is even harder. Maybe I should go shopping.

## Tuesday, April 15, 2008

### Faster

In my last entry I had gotten the code working to find the second
eigenvector, but it was really slow. The obvious culprit was the
matrix multiply and it was clear that much could be done to speed it
up. Ultimately, though, that is a waste of time.

I want to cluster some 58000 data points. The similarity matrix for this size data set would have 3,364,000,000 elements. Even if it were reasonable to try to fit this into memory, it would take a fair amount of time to perform the operations. The trick is to avoid creating the matrix in the first place.

The matrix Q is defined like this:

and we will be multiplying this times a vector to find the second eigenvector. But instead of composing the matrix prior to multiplication, we'll leave the matrix in its factored form. That is, if a matrix has two factors, X and Y, and you want to multiply it times a vector, then this equivalence holds:

So if we have a procedure that multiplies by Q:

we can rewrite it as:

We've traded a single huge matrix multiply for a series of matrix-vector multiplies. This wouldn't normally be to our advantage, but the matrices D, R-inverse, and D-inverse are diagonal, so we can take advantage of that fact in two ways. First, since the off-diagonal elements are zero, we needn't multiply them. Second, since the off-diagonal elements are zero, we don't even need to represent them. This is a savings in memory and time because we now only need one vector ref to fetch the contents of the diagonal.

Matrix A, however, is still big, and its transpose, AT, isn't any smaller. But multiplying by a transpose on the left is the same as multiplying on the right, so we can rewrite the form like this:

And now we don't need to transpose A.

Matrix A is still a big matrix, but it is only filled with zeros and ones. Most of the elements are zero. Rather than represent it in the obvious, naive way, we can represent it as a vector of rows where each row is a simple list of the columns that contain a one. Not only is this much smaller, it is much faster to multiply because we can skip the zero element.

With these changes we have completely eliminated the matrix multiply.

I want to cluster some 58000 data points. The similarity matrix for this size data set would have 3,364,000,000 elements. Even if it were reasonable to try to fit this into memory, it would take a fair amount of time to perform the operations. The trick is to avoid creating the matrix in the first place.

The matrix Q is defined like this:

;; M* is matrix multiply (define Q (let ((AT (M* A (transpose A)))) (M* D R-inverse AT D-inverse))).

and we will be multiplying this times a vector to find the second eigenvector. But instead of composing the matrix prior to multiplication, we'll leave the matrix in its factored form. That is, if a matrix has two factors, X and Y, and you want to multiply it times a vector, then this equivalence holds:

(M*V (M* X Y) v) <--> (M*V X (M*V Y v)).

So if we have a procedure that multiplies by Q:

(define do-q (let* ((AT (M* A (transpose A))) (Q (M* D R-inverse AT D-inverse))) (lambda (v) (M*V Q v)))).

we can rewrite it as:

(define do-q (let* ((TA (transpose A))) (lambda (v) (M*V D (M*V R-inverse (M*V A (M*V TA (M*V D-inverse v)))))))).

We've traded a single huge matrix multiply for a series of matrix-vector multiplies. This wouldn't normally be to our advantage, but the matrices D, R-inverse, and D-inverse are diagonal, so we can take advantage of that fact in two ways. First, since the off-diagonal elements are zero, we needn't multiply them. Second, since the off-diagonal elements are zero, we don't even need to represent them. This is a savings in memory and time because we now only need one vector ref to fetch the contents of the diagonal.

Matrix A, however, is still big, and its transpose, AT, isn't any smaller. But multiplying by a transpose on the left is the same as multiplying on the right, so we can rewrite the form like this:

(define DM*V diagonal-matrix*vector) (define DM*V vector*diagonal-matrix) (define do-q (lambda (v) (DM*V D (DM*V R-inverse (M*V A (V*DM (M*V D-inverse v) A)))))).

And now we don't need to transpose A.

Matrix A is still a big matrix, but it is only filled with zeros and ones. Most of the elements are zero. Rather than represent it in the obvious, naive way, we can represent it as a vector of rows where each row is a simple list of the columns that contain a one. Not only is this much smaller, it is much faster to multiply because we can skip the zero element.

(define (sparse-binary-matrix-times-vector matrix vector result-vector) (unless (= (car matrix) (vector-length vector)) (error "Cannot multiply")) (unless (= (vector-length result-vector) (matrix-height (cdr matrix))) (error "Cannot multiply")) (dotimes (row (matrix-height (cdr matrix)) result-vector) (let ((row-elements (vector-ref (cdr matrix) row))) (do ((tail row-elements (cdr tail))) ((null? tail)) (let ((column (car tail))) (vector-set! result-vector row (+ (vector-ref result-vector row) (vector-ref vector column)))))))).

With these changes we have completely eliminated the matrix multiply.

## Friday, April 11, 2008

### Learning by programming

One of the best ways to learn something is to attempt to teach it to someone else. If you want to thoroughly learn the details, try teaching a computer.

My linear algebra is very rusty and it wasn't much good to begin with, so rather than snarf some code that did spectral clustering, I've decided to teach my computer how to do it. My experience has taught me that whenever there is heavy math involved in something, I always screw it up in some subtle way. I either misplace a sign, forget a term, or transpose something and the answer comes out wrong. Then I spend hours trying to understand what the answer means only to discover that it wasn't even computed correctly in the first place. Based on this experience, I take the approach of “First make it correct, then make it perform.”

So let's teach my computer how to do spectral clustering. First, we need a representation of a matrix. A vector of rows is not optimal, but it is simple.

This gets us started. (By the way, hygienic macros are fantastic for things like `dotimes'. Accidental variable capture is not a problem.) We need to multiply these matrixes. I've screwed this up so many times (row vs. column, left vs. right) that I wrote a test program to be sure that I got it right. Getting it right once is easy, keeping it right as we optimize is hard.

Now on to the naive multiplication routine. Another macro comes in handy.

It's naive, it's slow, but it is correct.

To create a spectral partition, we need the similarity matrix. This is easily computed by multiplying the matrix by its transpose.

We first need the row sums of the similarity matrix.

And we'll be making the matrix R by creating a diagonal matrix of those sums.

Vector PI is the normalized row sums.

And matrix D is the diagonalized square root of Pi.

We need to compute the inverses of R and D. Inverting a matrix is usually tricky, but since R and D are both diagonal, we can simply take the reciprocal of the diagonal elements. This function won't validate that the matrix is in fact diagonal, so the user ought to be aware!

We need one more thing. We need to multiply a matrix and vector. Since a vector (mathematically) is the same as a matrix with a single row or column, we can write simple converters and use our matrix multiply routine.

At this point, we can write the code to do the first part of the spectral partitioning.

Does it work?

So far so good! Let's check the first eigenvector.

And we'll need to create an orthogonal vector to get the second eigenvector.

This code seems to work, but it is awfully slow. It takes almost two seconds to run on an initial matrix of 25 rows. I want to run on 58000 rows. I'll fix it tomorrow.

My linear algebra is very rusty and it wasn't much good to begin with, so rather than snarf some code that did spectral clustering, I've decided to teach my computer how to do it. My experience has taught me that whenever there is heavy math involved in something, I always screw it up in some subtle way. I either misplace a sign, forget a term, or transpose something and the answer comes out wrong. Then I spend hours trying to understand what the answer means only to discover that it wasn't even computed correctly in the first place. Based on this experience, I take the approach of “First make it correct, then make it perform.”

So let's teach my computer how to do spectral clustering. First, we need a representation of a matrix. A vector of rows is not optimal, but it is simple.

;;; A dotimes macro, very convenient for matrix math. (define-syntax dotimes (syntax-rules () ((dotimes (variable limit-expr return-value ...) body0 body ...) (let ((limit limit-expr)) (do ((variable 0 (+ variable 1))) ((>= variable limit) return-value ...) body0 body ...))))) (define (allocate-matrix height width) (let ((rows (make-vector height #f))) (dotimes (i height rows) (vector-set! rows i (make-vector width 0))))) (define (matrix-height matrix) (vector-length matrix)) (define (matrix-width matrix) (vector-length (vector-ref matrix 0))) (define (matrix-ref matrix i j) (vector-ref (vector-ref matrix i) j)) (define (matrix-set! matrix i j value) (vector-set! (vector-ref matrix i) j value)).

This gets us started. (By the way, hygienic macros are fantastic for things like `dotimes'. Accidental variable capture is not a problem.) We need to multiply these matrixes. I've screwed this up so many times (row vs. column, left vs. right) that I wrote a test program to be sure that I got it right. Getting it right once is easy, keeping it right as we optimize is hard.

(define (test-multiply matrix-multiply) (let ((matrix-a (allocate-matrix 2 3)) (matrix-b (allocate-matrix 3 2))) ;; a, row 0 (1 0 2) (matrix-set! matrix-a 0 0 1) (matrix-set! matrix-a 0 1 0) (matrix-set! matrix-a 0 2 2) ;; a, row 1 (-1 3 1) (matrix-set! matrix-a 1 0 -1) (matrix-set! matrix-a 1 1 3) (matrix-set! matrix-a 1 2 1) ;; b, row 0 (3 1) (matrix-set! matrix-b 0 0 3) (matrix-set! matrix-b 0 1 1) ;; b, row 1 (2 1) (matrix-set! matrix-b 1 0 2) (matrix-set! matrix-b 1 1 1) ;; b, row 2 (1 0) (matrix-set! matrix-b 2 0 1) (matrix-set! matrix-b 2 1 0) (let ((result (matrix-multiply matrix-a matrix-b))) (or (and (= (matrix-ref result 0 0) 5) (= (matrix-ref result 0 1) 1) (= (matrix-ref result 1 0) 4) (= (matrix-ref result 1 1) 2)) (error "broken"))))).

Now on to the naive multiplication routine. Another macro comes in handy.

(define-syntax sum (syntax-rules () ((sum index-variable start limit-expr term) (let ((limit limit-expr)) (do ((index-variable start (+ index-variable 1)) (answer 0 (+ answer term))) ((>= index-variable limit) answer)))))) (define (matrix-multiply-0 left right) (unless (= (matrix-width left) (matrix-height right)) (error "Cannot multiply")) (let* ((result-height (matrix-height left)) (result-width (matrix-width right)) (result-matrix (allocate-matrix result-height result-width))) (dotimes (row result-height result-matrix) (dotimes (column result-width) (matrix-set! result-matrix row column (sum k 0 (matrix-height right) (* (matrix-ref left row k) (matrix-ref right k column)))))))).

It's naive, it's slow, but it is correct.

To create a spectral partition, we need the similarity matrix. This is easily computed by multiplying the matrix by its transpose.

(define (transpose matrix) (let* ((height (matrix-height matrix)) (width (matrix-width matrix)) (new-matrix (allocate-matrix width height))) (dotimes (i height new-matrix) (dotimes (j width) (matrix-set! new-matrix j i (matrix-ref matrix i j)))))) (define (similarity-matrix matrix) (matrix-multiply matrix (transpose matrix))).

We first need the row sums of the similarity matrix.

(define (matrix-row-sums matrix) (let* ((height (matrix-height matrix)) (width (matrix-width matrix)) (answer (make-vector height 0))) (dotimes (i height answer) (vector-set! answer i (sum j 0 width (matrix-ref matrix i j)))))).

And we'll be making the matrix R by creating a diagonal matrix of those sums.

(define (compute-R row-sums) (let* ((height (vector-length row-sums)) (r (allocate-matrix height height))) (dotimes (i height r) (matrix-set! r i i (vector-ref row-sums i))))).

Vector PI is the normalized row sums.

(define (compute-pi row-sums) (let* ((length (vector-length row-sums)) (total (summation i 0 length (vector-ref row-sums i))) (pi (make-vector length 0))) (dotimes (i length pi) (vector-set! pi i (/ (vector-ref row-sums i) total))))).

And matrix D is the diagonalized square root of Pi.

(define (compute-d pi) (let* ((h (vector-length pi)) (d (allocate-matrix h h))) (dotimes (i h d) (matrix-set! d i i (sqrt (vector-ref pi i)))))).

We need to compute the inverses of R and D. Inverting a matrix is usually tricky, but since R and D are both diagonal, we can simply take the reciprocal of the diagonal elements. This function won't validate that the matrix is in fact diagonal, so the user ought to be aware!

(define (compute-diagonal-inverse diagonal) (let* ((height (matrix-height diagonal)) (width (matrix-width diagonal)) (inverse (allocate-matrix height width))) (dotimes (i height inverse) (matrix-set! inverse i i (/ 1 (matrix-ref diagonal i i)))))).

We need one more thing. We need to multiply a matrix and vector. Since a vector (mathematically) is the same as a matrix with a single row or column, we can write simple converters and use our matrix multiply routine.

(define (vector->column-matrix vector) (let* ((height (vector-length vector)) (answer (allocate-matrix height 1))) (dotimes (i height answer) (matrix-set! answer i 0 (vector-ref vector i))))) (define (vector->row-matrix vector) (let* ((width (vector-length vector)) (answer (allocate-matrix 1 width))) (dotimes (i width answer) (matrix-set! answer 0 i (vector-ref vector i))))) (define (matrix->column-vector matrix) (let* ((length (matrix-height matrix)) (answer (make-vector length 0))) (dotimes (i length answer) (vector-set! answer i (matrix-ref matrix i 0))))) (define (matrix->row-vector matrix) (let* ((length (matrix-width matrix)) (answer (make-vector length 0))) (dotimes (j width answer) (vector-set! answer j (matrix-ref matrix 0 j))))) (define (vector-times-matrix vector matrix) (matrix->column-vector (matrix-multiply (vector->row-matrix vector) matrix))) (define (matrix-times-vector vector matrix) (matrix->row-vector (matrix-multiply matrix (vector->column-matrix vector)))).

At this point, we can write the code to do the first part of the spectral partitioning.

(define (compute-q a) (let* ((atrans (transpose a)) (att (matrix-multiply a atrans)) (rho (matrix-row-sums att)) (r (compute-r rho)) (pi (compute-pi rho)) (d (compute-d pi)) (rinverse (compute-diagonal-inverse r)) (dinverse (compute-diagonal-inverse d)) (q (matrix-multiply d (matrix-multiply rinverse (matrix-multiply att dinverse))))) q))

Does it work?

((.271 .092 .092 .081 .074 .074 .076 .102 0. 0.) (.092 .246 .244 .072 .065 .065 .068 .109 0. 0.) (.092 .244 .258 .071 .065 .065 .067 .108 0. 0.) (.081 .072 .071 .188 .171 .171 .178 .079 0. 0.) (.074 .065 .065 .171 .229 .219 .195 .072 0. 0.) (.074 .065 .065 .171 .219 .229 .195 .072 0. 0.) (.076 .068 .067 .178 .195 .195 .202 .075 0. 0.) (.102 .109 .108 .079 .072 .072 .075 .26 0. 0.) (0. 0. 0. 0. 0. 0. 0. 0. .571 .429) (0. 0. 0. 0. 0. 0. 0. 0. .429 .571)).

So far so good! Let's check the first eigenvector.

(define (close-enuf? left right tolerance) (let ((limit (vector-length left))) (define (scan i) (or (>= i limit) (and (< (abs (- (vector-ref left i) (vector-ref right i))) tolerance) (scan (+ i 1))))) (scan 0))) (define (test-eigenvector matrix eigenvector) (or (close-enuf? eigenvector (matrix-times-vector matrix eigenvector) .00001) (error "It's not an eigenvector"))).

And we'll need to create an orthogonal vector to get the second eigenvector.

(define (vector-normalize v) (let* ((h (vector-length v)) (answer (make-vector h 0)) (total (sqrt (sum i 0 h (let ((x (vector-ref v i))) (* x x)))))) (dotimes (i h answer) (vector-set! answer i (/ (vector-ref v i) total))))) (define (orthogonal-vector v) (let ((limit (vector-length v)) (n (make-vector (vector-length v) 0))) (dotimes (i (floor (/ limit 2))) (vector-set! n i (vector-ref v (- limit i 1))) (vector-set! n (- limit i 1) (- (vector-ref v i)))) (vector-normalize n))) (define (analyze a) (let* ((atrans (transpose a)) (att (matrix-multiply a atrans)) (rho (matrix-row-sums att)) (r (compute-r rho)) (pi (compute-pi rho)) (d (compute-d pi)) (rinverse (compute-diagonal-inverse r)) (dinverse (compute-diagonal-inverse d)) (q (matrix-multiply d (matrix-multiply rinverse (matrix-multiply att dinverse)))) (first-eigenvector (vector-times-matrix pi dinverse))) (test-eigenvector q first-eigenvector) (let ((initial-vector (orthogonal-vector first-eigenvector))) (do ((v (vector-normalize (matrix-times-vector q initial-vector)) (vector-normalize (matrix-times-vector q v))) (old-v initial-vector v)) ((close-enuf? v old-v 0.0001) v))))).

This code seems to work, but it is awfully slow. It takes almost two seconds to run on an initial matrix of 25 rows. I want to run on 58000 rows. I'll fix it tomorrow.

## Thursday, April 10, 2008

### Oooh, code...

I have 58000 data points that I want to cluster. Each data point consists of a set of attributes. The attributes are simply binary — a data point either has the attribute or not. There are 84000 attributes all told, but each data point usually has only a few. No data point has more than about 100.

I found a paper on Spectral Clustering that looks interesting. The basic idea is to construct a similarity graph from the data points and then cut the graph into parts where there are minimal connections. It turns out that you can find good cut points by looking at the second eigenvector of the similarity matrix. The problem is that a (58000 x 84000) matrix has more than 4 billion entries. So you have to use some tricks.

Imagine that you have a

In Scheme, we can represent this as a vector of vectors. Using a tiny data set, it looks like this:

We first want to make table that describes the ‘similarity’ between data points. For every pair of data points, we measure how similar they are. Each data point will have both a row and a column. We can just look up the similarity by finding the row for one of the data points and the column for the other and looking at the intersection. The similarity matrix for the above looks like this:

If we have represented our data points like above, the similarity matrix is the result of multiplying the data point matrix by its transpose. (That is, we take the dot product of each row of the data point matrix with every other row. This is also the same as simply counting the number of common terms.)

We want to normalize the above graph by making the sum of the terms in each row add up to one. First we sum the terms in each row:

Then we construct the diagonal matrix R:

Then we invert R.

Finally, we multiply this against our similarity matrix.

This is a really roundabout way to divide every term in a row by the sum of the entire row, but there's a reason for this sort of insanity.

Now we want to compute the second eigenvector this thing. This isn't easy to do, so instead we'll find a related matrix which has an easy one to compute. Starting with our row sums, we normalize, take the reciprocals, then the square roots and make a diagonal matrix:

And we compute matrix Q by multiplying our normalized similarity matrix by the square root matrix on the left and multiplying that by the inverse of the square root matrix on the right.

The matrix Q is symmetric! This allows us to find the first eigenvector easily. (It's that diagonal matrix we just constructed.) Recall that the interesting thing about an eigenvector is that it remains unchanged when we multiply it by the matrix.

So how do we find the second eigenvector? Well, if we start with a vector that is orthogonal to the first eigenvector, normalize it, and find the fixed point of multiplying by Q, we'll get the second eigenvector. (If it starts orthogonal to the first eigenvector, it will remain orthogonal, but converge to the second eigenvector).

This second eigenvector is related to the second eigenvector of our original similarity matrix. We just multiply it on the left by the inverse of the diagonal square root matrix.

But wait! There's more... tomorrow.

I found a paper on Spectral Clustering that looks interesting. The basic idea is to construct a similarity graph from the data points and then cut the graph into parts where there are minimal connections. It turns out that you can find good cut points by looking at the second eigenvector of the similarity matrix. The problem is that a (58000 x 84000) matrix has more than 4 billion entries. So you have to use some tricks.

Imagine that you have a

*huge*spreadsheet. Each column is labeled with an attribute. A data point can be represented by a row in the spreadsheet: you just put a checkmark in the appropriate column if that data point has the attribute. Since no data point has more than about 100 attributes, there shouldn't be more than about 100 checkmarks in each row. Here's a simple example:Available | Has > 1GB | Linux | |

box1 | check | check | |

box2 | check | check | check |

box3 | check | check | check |

#(#(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1) #(0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0) #(0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0) #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) #(1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)).

We first want to make table that describes the ‘similarity’ between data points. For every pair of data points, we measure how similar they are. Each data point will have both a row and a column. We can just look up the similarity by finding the row for one of the data points and the column for the other and looking at the intersection. The similarity matrix for the above looks like this:

#(#(13 5 5 5 5 5 5 5 0 0) #( 5 15 15 5 5 5 5 6 0 0) #( 5 15 16 5 5 5 5 6 0 0) #( 5 5 5 15 15 15 15 5 0 0) #( 5 5 5 15 22 21 18 5 0 0) #( 5 5 5 15 21 22 18 5 0 0) #( 5 5 5 15 18 18 18 5 0 0) #( 5 6 6 5 5 5 5 13 0 0) #( 0 0 0 0 0 0 0 0 4 3) #( 0 0 0 0 0 0 0 0 3 4)).

If we have represented our data points like above, the similarity matrix is the result of multiplying the data point matrix by its transpose. (That is, we take the dot product of each row of the data point matrix with every other row. This is also the same as simply counting the number of common terms.)

We want to normalize the above graph by making the sum of the terms in each row add up to one. First we sum the terms in each row:

#(#(13 5 5 5 5 5 5 5 0 0) 48 #( 5 15 15 5 5 5 5 6 0 0) 61 #( 5 15 16 5 5 5 5 6 0 0) 62 #( 5 5 5 15 15 15 15 5 0 0) 80 #( 5 5 5 15 22 21 18 5 0 0) 96 #( 5 5 5 15 21 22 18 5 0 0) 96 #( 5 5 5 15 18 18 18 5 0 0) 89 #( 5 6 6 5 5 5 5 13 0 0) 50 #( 0 0 0 0 0 0 0 0 4 3) 7 #( 0 0 0 0 0 0 0 0 3 4)) 7.

Then we construct the diagonal matrix R:

#(#(48 0 0 0 0 0 0 0 0 0) #(0 61 0 0 0 0 0 0 0 0) #(0 0 62 0 0 0 0 0 0 0) #(0 0 0 80 0 0 0 0 0 0) #(0 0 0 0 96 0 0 0 0 0) #(0 0 0 0 0 96 0 0 0 0) #(0 0 0 0 0 0 89 0 0 0) #(0 0 0 0 0 0 0 50 0 0) #(0 0 0 0 0 0 0 0 7 0) #(0 0 0 0 0 0 0 0 0 7)).

Then we invert R.

#(#(1/48 0 0 0 0 0 0 0 0 0) #(0 1/61 0 0 0 0 0 0 0 0) #(0 0 1/62 0 0 0 0 0 0 0) #(0 0 0 1/80 0 0 0 0 0 0) #(0 0 0 0 1/96 0 0 0 0 0) #(0 0 0 0 0 1/96 0 0 0 0) #(0 0 0 0 0 0 1/89 0 0 0) #(0 0 0 0 0 0 0 1/50 0 0) #(0 0 0 0 0 0 0 0 1/7 0) #(0 0 0 0 0 0 0 0 0 1/7)).

Finally, we multiply this against our similarity matrix.

#(#(13/48 5/48 5/48 5/48 5/48 5/48 5/48 5/48 0 0) #(5/61 15/61 15/61 5/61 5/61 5/61 5/61 6/61 0 0) #(5/62 15/62 8/31 5/62 5/62 5/62 5/62 3/31 0 0) #(1/16 1/16 1/16 3/16 3/16 3/16 3/16 1/16 0 0) #(5/96 5/96 5/96 5/32 11/48 7/32 3/16 5/96 0 0) #(5/96 5/96 5/96 5/32 7/32 11/48 3/16 5/96 0 0) #(5/89 5/89 5/89 15/89 18/89 18/89 18/89 5/89 0 0) #(1/10 3/25 3/25 1/10 1/10 1/10 1/10 13/50 0 0) #(0 0 0 0 0 0 0 0 4/7 3/7) #(0 0 0 0 0 0 0 0 3/7 4/7)).

This is a really roundabout way to divide every term in a row by the sum of the entire row, but there's a reason for this sort of insanity.

Now we want to compute the second eigenvector this thing. This isn't easy to do, so instead we'll find a related matrix which has an easy one to compute. Starting with our row sums, we normalize, take the reciprocals, then the square roots and make a diagonal matrix:

#(48 61 62 80 96 96 89 50 7 7) ;; normalize #(12/149 61/596 31/298 20/149 24/149 24/149 89/596 25/298 7/596 7/596) ;; Square root (.28 .32 .32 .37 .4 .4 .39 .29 .11 .11) ;; Diagonalize ((.28 0 0 0 0 0 0 0 0 0) (0 .32 0 0 0 0 0 0 0 0) (0 0 .32 0 0 0 0 0 0 0) (0 0 0 .37 0 0 0 0 0 0) (0 0 0 0 .4 0 0 0 0 0) (0 0 0 0 0 .4 0 0 0 0) (0 0 0 0 0 0 .39 0 0 0) (0 0 0 0 0 0 0 .29 0 0) (0 0 0 0 0 0 0 0 .11 0) (0 0 0 0 0 0 0 0 0 .11)).

And we compute matrix Q by multiplying our normalized similarity matrix by the square root matrix on the left and multiplying that by the inverse of the square root matrix on the right.

((.27 .09 .09 .08 .07 .07 .08 .1 0. 0.) (.09 .25 .24 .07 .07 .07 .07 .11 0. 0.) (.09 .24 .26 .07 .06 .06 .07 .11 0. 0.) (.08 .07 .07 .19 .17 .17 .18 .08 0. 0.) (.07 .07 .06 .17 .23 .22 .19 .07 0. 0.) (.07 .07 .06 .17 .22 .23 .19 .07 0. 0.) (.08 .07 .07 .18 .19 .19 .2 .07 0. 0.) (.1 .11 .11 .08 .07 .07 .07 .26 0. 0.) (0. 0. 0. 0. 0. 0. 0. 0. .57 .43) (0. 0. 0. 0. 0. 0. 0. 0. .43 .57)).

The matrix Q is symmetric! This allows us to find the first eigenvector easily. (It's that diagonal matrix we just constructed.) Recall that the interesting thing about an eigenvector is that it remains unchanged when we multiply it by the matrix.

first-eigenvector (.28 .32 .32 .37 .4 .4 .39 .29 .11 .11) (matrix-times-vector q first-eigenvector) (.28 .32 .32 .37 .4 .4 .39 .29 .11 .11).

So how do we find the second eigenvector? Well, if we start with a vector that is orthogonal to the first eigenvector, normalize it, and find the fixed point of multiplying by Q, we'll get the second eigenvector. (If it starts orthogonal to the first eigenvector, it will remain orthogonal, but converge to the second eigenvector).

first-eigenvector (.28 .32 .32 .37 .4 .4 .39 .29 .11 .11) orthogonal-vector (.11 .11 .29 .39 .4 -.4 -.37 -.32 -.32 -.28) normalized (.11 .11 .29 .39 .4 -.4 -.37 -.32 -.32 -.28) iterative multiplication by q and normalizing (.082 .169 .178 .043 .022 .003 .013 -.06 -.687 -.675) (.055 .094 .096 .042 .04 .04 .041 .037 -.698 -.696) (.049 .067 .068 .05 .052 .052 .051 .048 -.699 -.698) (.046 .057 .057 .054 .058 .058 .056 .047 -.699 -.699) (.045 .053 .053 .056 .06 .06 .058 .046 -.699 -.699) (.044 .051 .051 .056 .061 .061 .059 .046 -.699 -.699) (.044 .05 .051 .057 .062 .062 .06 .045 -.699 -.699) (.044 .05 .05 .057 .062 .062 .06 .045 -.699 -.699) (.044 .05 .05 .057 .062 .062 .06 .045 -.699 -.699).

This second eigenvector is related to the second eigenvector of our original similarity matrix. We just multiply it on the left by the inverse of the diagonal square root matrix.

But wait! There's more... tomorrow.

## Wednesday, April 2, 2008

Subscribe to:
Posts (Atom)