(define (10log10 x) (* 10.0 (/ (log x) (log 10.0)))) ; (define (sum f list) (fold-left (lambda (sum element) (+ sum (f element))) 0 list)) ; (define (count-if predicate list) (sum (lambda (element) (if (predicate element) 1 0)) list)) ; (define (count-if-not predicate list) (sum (lambda (element) (if (predicate element) 0 1)) list)) ; (define (weight-if-present has-feature? positive-examples negative-examples) (10log10 (/ (* (+ (count-if has-feature? positive-examples) .5) (length negative-examples)) (* (+ (count-if has-feature? negative-examples) .5) (length positive-examples))))) ; (define (weight-if-absent has-feature? positive-examples negative-examples) (10log10 (/ (* (+ (count-if-not has-feature? positive-examples) .5) (length negative-examples)) (* (+ (count-if-not has-feature? negative-examples) .5) (length positive-examples))))) ; (define (weight has-feature? positive-examples negative-examples test-element) (if (has-feature? test-element) (weight-if-present has-feature? positive-examples negative-examples) (weight-if-absent has-feature? positive-examples negative-examples))) ; (define (naive-bayesian-classifier feature-list positive-examples negative-examples) (lambda (test-element) (sum (lambda (feature) (weight feature positive-examples negative-examples test-element)) feature-list)))
.
Obviously this can radically improved for performance. Also note that your version of Scheme might take the arguments to fold-left in a different order. In Statistics and the War on Spam, Dave Madigan gives a really good introduction to naive Bayesian classifiers. The implementation above is based on that paper. The paper also describes why you add .5 to all the counts. To take the example directly from the paper:;;; Message sets with stop words removed ;;; and word stemming applied. (define *ham-messages* '((brown fox quick) (rabbit run run run))) ; (define *spam-messages* '((quick rabbit run run) (rabbit rest))) ; ;;; A list of six procedures, each tests for ;;; the presence of a single word in a message. (define *features* (map (lambda (word) (lambda (message) (member word message))) '(brown fox quick rabbit rest run))) ; (define *example-classifier* (naive-bayesian-classifier *features* *ham-messages* *spam-messages*)) ; ;;; Classify a new message. (*example-classifier* '(quick rabbit rest)) ;Value: -11.426675035687316
.
The value returned differs from the one in the paper for two reasons. I use positive values to indicate `ham' and negative for `spam', rather than vice-versa. I also compute the logarithm in base 10 rather than the natural log and multiply the result by 10 (decibels rule!). This makes it much easier to eyeball the results. Negative is bad, and every 10 dB is another order of magnitude. With a value of -11, I can see that this is spam with a better than 90% probability. (10 dB = 90%, 20 dB = 99%, 30 dB = 99.9%, 40 dB = 99.99%, etc.) Stupid blogger removes blank lines in PRE sections, so I inserted semicolons. It also has problems transitioning back out of PRE, so I have periods in there as well. I don't like to blog anyway, and this just adds another reason to avoid it.
Nice post.
ReplyDeleteWith respect to the line break problem, try the following:
In Blogger choose the "Preferences" tab, then the subtab "Formating".
Then check that "Convert Linebreak" is set to "no".
Hi,
ReplyDeleteThis item is in the atom feed for the PLT Scheme blog: http://blog.plt-scheme.org/feeds/posts/default
(but not on the PLT Scheme blog itself http://blog.plt-scheme.org/)
The link from the PLT Scheme blog feed gives a 404; http://blog.plt-scheme.org/2007/08/naive-bayesian-classifier.html
Cheers,
Stephen
Yeah, I pressed the wrong button and accidentally posted it to the PLT Scheme blog. Fortunately, it wasn't totally irrelevant, and I fixed it in about 15 seconds, but it seems I left some evidence behind.
ReplyDelete