## Tuesday, August 7, 2007

### Naive Bayesian Classifier

In analyzing my data I wanted to classify it with a naive Bayesian classifier. I wasn't sure I had the math right, so I wrote a tiny abstract classifier to test with. The code is pretty cool:

```(define (10log10 x)
(* 10.0 (/ (log x) (log 10.0))))
;
(define (sum f list)
(fold-left (lambda (sum element)
(+ sum (f element)))
0
list))
;
(define (count-if predicate list)
(sum (lambda (element)
(if (predicate element)
1
0))
list))
;
(define (count-if-not predicate list)
(sum (lambda (element)
(if (predicate element)
0
1))
list))
;
(define (weight-if-present has-feature? positive-examples
negative-examples)
(10log10
(/ (* (+ (count-if has-feature? positive-examples) .5)
(length negative-examples))
(* (+ (count-if has-feature? negative-examples) .5)
(length positive-examples)))))
;
(define (weight-if-absent has-feature? positive-examples
negative-examples)
(10log10
(/ (* (+ (count-if-not has-feature? positive-examples) .5)
(length negative-examples))
(* (+ (count-if-not has-feature? negative-examples) .5)
(length positive-examples)))))
;
(define (weight has-feature?
positive-examples negative-examples
test-element)
(if (has-feature? test-element)
(weight-if-present has-feature? positive-examples
negative-examples)
(weight-if-absent  has-feature? positive-examples
negative-examples)))
;
(define (naive-bayesian-classifier feature-list positive-examples
negative-examples)
(lambda (test-element)
(sum (lambda (feature)
(weight feature
positive-examples negative-examples
test-element))
feature-list)))
```

.

Obviously this can radically improved for performance. Also note that your version of Scheme might take the arguments to fold-left in a different order. In Statistics and the War on Spam, Dave Madigan gives a really good introduction to naive Bayesian classifiers. The implementation above is based on that paper. The paper also describes why you add .5 to all the counts. To take the example directly from the paper:

```;;; Message sets with stop words removed
;;; and word stemming applied.
(define *ham-messages*
'((brown fox quick)
(rabbit run run run)))
;
(define *spam-messages*
'((quick rabbit run run)
(rabbit rest)))
;
;;; A list of six procedures, each tests for
;;; the presence of a single word in a message.
(define *features*
(map (lambda (word)
(lambda (message)
(member word message)))
'(brown fox quick rabbit rest run)))
;
(define *example-classifier*
(naive-bayesian-classifier
*features*
*ham-messages*
*spam-messages*))
;
;;; Classify a new message.
(*example-classifier* '(quick rabbit rest))
;Value: -11.426675035687316
```

.

The value returned differs from the one in the paper for two reasons. I use positive values to indicate `ham' and negative for `spam', rather than vice-versa. I also compute the logarithm in base 10 rather than the natural log and multiply the result by 10 (decibels rule!). This makes it much easier to eyeball the results. Negative is bad, and every 10 dB is another order of magnitude. With a value of -11, I can see that this is spam with a better than 90% probability. (10 dB = 90%, 20 dB = 99%, 30 dB = 99.9%, 40 dB = 99.99%, etc.) Stupid blogger removes blank lines in PRE sections, so I inserted semicolons. It also has problems transitioning back out of PRE, so I have periods in there as well. I don't like to blog anyway, and this just adds another reason to avoid it.

Jens Axel Søgaard said...

Nice post.

With respect to the line break problem, try the following:

In Blogger choose the "Preferences" tab, then the subtab "Formating".

Then check that "Convert Linebreak" is set to "no".

Stephen said...

Hi,

This item is in the atom feed for the PLT Scheme blog: http://blog.plt-scheme.org/feeds/posts/default
(but not on the PLT Scheme blog itself http://blog.plt-scheme.org/)

The link from the PLT Scheme blog feed gives a 404; http://blog.plt-scheme.org/2007/08/naive-bayesian-classifier.html

Cheers,

Stephen

Joe Marshall said...

Yeah, I pressed the wrong button and accidentally posted it to the PLT Scheme blog. Fortunately, it wasn't totally irrelevant, and I fixed it in about 15 seconds, but it seems I left some evidence behind.