Tuesday, December 11, 2007

How many strings can you intern?

Assuming a 32-bit OS and processor. Suppose we define INTERN to take a series of bytes with an upper limit on the number of bytes (say 128 bytes) and return an unsigned integer in [0, 2^32). Another function SYMBOL-NAME takes the unsigned integer and returns the associated bytes (or stuffs them in a buffer, or whatever). How many strings could we reasonably intern? Could we get to 2^32 without going to disk?

You're a regular when...

your picture appears on the web page of the pub you hang out at.

That's me and the missus hanging out.

Drop in if you want to talk Lisp or Scheme. I'm there nearly every evening.

Monday, December 10, 2007

I saw that Google is putting up $20M dollars as a prize for landing a rover on the moon. (See http://www.googlelunarxprize.org/) I was wondering if that were remotely worth the effort.

You can just hire a company to put a satellite in orbit. The Falcon 9 is supposed to be “the lowest cost per pound to orbit” and it costs $35M to put a small satellite into geosynchronous transfer orbit. Of course the Falcon 9 doesn't actually exist yet, but if it did, you'd only start out $15M in the hole.

Part of the problem of a rocket is that you spend a lot of the energy lifting the fuel and oxygen to the altitude where you plan to use them. I was daydreaming a way to avoid that. If you use a ramjet or scramjet for the lower altitudes, you can use the existing oxygen in the air. This cuts the amount of lifting you have to do by half. Now if you could just figure out a way to get the fuel to where you need it without having to carry it with you.

My initial infeasible idea was to have a long hose.

But then I was thinking that you don't really need a hose to keep the fuel confined. If you squirt a droplet or two of fuel in the path of the oncoming jet, it can be sucked into the intake just like the oxygen. It shouldn't be that hard to time it right. Of course it wouldn't be possible to build a tower tall enough to make this useful, but maybe if you put the tower on the side of a mountain so you can get a few miles of thrust before you have to use the internal fuel supply.

Monday, November 26, 2007

Origin of CAR and CDR

“It was soon noticed that extraction of a subexpression involved composing the extraction of the address part with cwr and that continuing along the list involved composing the extraction of the decrement part with cwr. Therefore, the compounds car, standing for ``Contents of the Address part of Register number'', and its analogs cdr, cpr, and ctr were defined. The motivation for implementing car and cdr separately was strengthened by the vulgar fact that the IBM 704 had instructions (connected with indexing) that made these operations easy to implement.”
--- John McCarthy, History of Lisp

The IBM 704 was not an early lisp machine.

Greenspun's Tenth Rule of Programming

“Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp.”

This quote is attributed to Philip Greenspun. Lately it seems someone got confused and attributed it to someone else and I'm seeing the misattribution over and over.

Thursday, November 22, 2007

In the Kingdom of Nouns

It is an open secret in the Kingom of Nouns that the King himself has a half-wit, good-for-nothing brother. His name was expunged from the official rolls long ago and everyone calls him null.

null is a true oddity in the Kingdom of Nouns. He is a second-class citizen. This is no doubt due to the fact that he is completely and utterly incapable of even the most trivial of actions. Whenever he is asked to perform a task, he responds by taking Exception and throwing a tantrum. Every other citizen in the Kingdom is expected to at least know his Class and how to present himself to the public. Not null, though.

The residents of the Kingdom take pride in order, so you might imagine that null's behavior would make him an outcast in the Kindom of Nouns. In fact, he had been disinherited by every single family in the kingdom!

null desperately wanted something to do. People would assign him mundane tasks like standing in for a citizen that was shortly expected to arrive or standing at the end of a line as a sort of marker. Occasionally when a request came from the castle, null would return to deliver the news that no one was able to satisfy the request.

Not everyone was impressed by null's `ability' (so to speak) to take up space. Sometimes, you just need a member of the right family or someone that is capable of performing a well-defined task. Requirements like these were spelled out explicitly in the contracts that people made: (TaskManager applicant) or (IMowLawns teenager). But because null was the king's brother, he had a special dispensation to ignore these restrictions. He'd walk right on in and pretend to be the manager or the teenager. Of course as soon as you asked him to do something, like start the mower or even tell you his name, he'd take Exception and throw another tantrum.

Some people just tried to pretend that null didn't exist and hope that the king or somebody would catch him as he threw his tantrum. Others took the time to proofread their contracts and add extra clauses everywhere to exclude null or deal with him in some ad-hoc way.

Unfortunately, there is no happy ending to this story. null continues vex and irritate the citizens of the Kingdom of Nouns and no relief is in sight.

Tuesday, November 20, 2007

Consider this problem. You are running a high school and you suspect some gang activities, so you want to be on the lookout for people wearing gang `uniforms'. But you don't know what any of the uniforms might be. So you watch the students and take note of what they are wearing: some have puffy jackets, some have hoodies, some have t-shirts, some have baggy pants, some have hats that they wear backwards, etc. etc. etc. Over the course of the day, you see a lot of different clothing, a lot of different combinations, and all in different colors. You note, however, that you did see five students wearing purple puffy jackets, baggy pants, and red baseball caps. What is the liklihood that this is a gang uniform? Obviously, it will depend on the size of the student population and the prevalence of certain kinds of clothing, but in the extreme case where you see, say, fifteen students with the *exact* same outfits that match in twenty-seven different random qualities, you can be pretty sure that is a gang. On the other hand, if you see two students that happen to wear the same color puffy jacket and otherwise have nothing in common, you can write that one off as coincidence. So given the number of students and the relative prevalence of various clothing elements, how many points of coincidence do you need to be fairly certain you have found a gang uniform?

Tuesday, October 9, 2007

Signs of life

It is showing signs of life. I have gotten to the bootstrap step where we replace the bootstrap version of MAKE-INSTANCE with the generic version. On the next call to MAKE-INSTANCE a lot of stuff happens that eventually lands us in the yet-to-be-implemented primary method of MAKE-INSTANCE.

At this point, I need to clean up a few of the kludges and finish implementing a few of the important functions (for example, my wimpy sort routine only works on lists of a single element so far).

Monday, October 8, 2007

Function Cells in C#

I found this interesting syntactic hack to mimic function cells in C#. Suppose you know the signature and name of a function at compile time, but the value will be computed at runtime and may change over time. First, define a delegate type for the signature of the function:

delegate object FooFunction (string s, int i);

In the class where the function Foo will be defined, we create a static variable to hold the delegate:

static FooFunction currentFooFunction;

and a Property to read it:

public static FooFunction Foo
{
    get
    {
        return currentFooFunction;
    }
}

You can now write code like this:
object bar = Foo("test", 4);
which appears to be an ordinary static function call, but in actuality is a call that indirects through the currentFooFunction.

The basic trick here is to have a Property that returns a delegate. The Property code can perform an arbitrary computation to return the delegate, and the compiler knows that a delegate returned from a Property call can be used in a syntactic function position.

Friday, October 5, 2007

That is my final answer

The IL version had a problem that I didn't notice before. Generic functions need to be closed over themselves. This allows the implementation to store a trampoline function when the generic is created and/or modified. The trampoline function recomputes what code is run when the generic is run, and then smashes itself out of the code path. Subsequent runs of the generic function don't go through the trampoline.

The problem is that the method object held by the .NET delegate is a read-only object. It cannot be overwritten. No problem, we'll just indirect through the field we can write. But we need to pass in the delegate object itself as an argument so we can modify it. In order to do that the delegate method needs access to the this pointer in the delegate. But the point of a delegate is to provide the delegated method with a this pointer. That is the Target field in the delegate. But the target object held by the .NET delegate is also read-only. It isn't possible to construct a delegate that delegates to itself in .NET (at least not without major black magic). We can create a delegate that delegates to a static method within the delegate class itself, but then we no longer have access to the instance object.

The solution is another level of indirection. We create a ManifestInstance class that has mutable state we need and closes over itself. The delegate simply defers to the ManifestInstance. This is the way delegates are intended to be used, so there is no need to use IL to add extra fields or anything. We can do this with standard, simple C#.

But I still want my StandardObjects to be printed with some sort of useful debugging info rather than {Method = {System.Object uninitializedFuncallableInstanceMethod(System.Object[])}}. The StandardObject delegate needs a DebuggerTypeProxy. A DebuggerTypeProxy is a helper object that tells the debugger how to render objects when the standard mechanisms aren't sufficient. But for reasons I don't understand, you can't associate a DebuggerTypeProxy with a .NET delegate! Argh!!!!

But I found a trick. You can associate a DebuggerDisplay attribute with a delegate. This is a string attribute that is interpreted in the context of the object being debugged. You can put things like "Object {Name}" in the DebuggerDisplay attribute and it will print the literal string `Object' followed by the Name field of the object. For our StandardObject delegate, we tell it to print the Target field. This sort of works: "{Target}", but then the debugger uses the ToStringmethod of the ManifestInstance. It makes it hard to visually distinguish the ManifestInstance from the delegate that refers to it. Instead, we want to get a computed property of the ManifestInstance. Unfortunately, this doesn't work: "{Target.ObjectDebuggerDisplay}", but this does: "{((ManifestInstance)Target).ObjectDebuggerDisplay}". Now when the debugger sees a StandardObject, it calls a method defined in ManifestInstance that has access to the object's state.

Since all this can be done in simple C# code, there is no reason to hack IL for StandardObjects. Back to the hacking....

Thursday, October 4, 2007

Trying again

Having standard-object implemented in a separate assembly is getting unweildy. I think I'm going to return to the manifest-instance technique.

The Visual Studio debugger can use certain annotations on your code to make debugging more pleasant. For example, I have annotated my Keyword and Symbol classes so that the watch window can print them correctly (not quite exactly, yet --- no escaping). When I hit a breakpoint, I can get a display like this:
NameValueType
foo:NAMEobject{Lisp.Keyword}
barCLOS::STANDARD-GENERIC-FUNCTIONobject{Lisp.Symbol}


Right now, however, my StandardObject delegates are printing like this:
NameValueType
generic{Method = {System.Object uninitializedFuncallableInstanceMethod(System.Object[])}}object{Lisp.StandardObject}


That's not so useful. I can add the debugging annotations to the StandardObject, but since it is in a different assembly, it creates a circular dependency. If I go back to the somewhat more clunky, but portable representation, I'll find it a lot easier to debug.

Later that same day....
Argh. You can't add a debugger type proxy to a delegate in C#. Back to IL.

Monday, October 1, 2007

Standard Object

I'm a fan of CLOS. When I wrote the .NET interface to Larceny, I based it on Eli Barzilay's Swindle, which is based on Tiny CLOS. Naturally, if I'm writing scripting code in Lisp, I want to have the same sort of interface to the underlying object layer.

CLOS has a couple of interesting requirements that can be tricky to implement. The first is the ability to construct funcallable-standard-instances, that is, first-class procedure objects that can have additional data associated with them. The second is the ability to call change-class on an instance to transform it in place to an instance of another class. (Why would I want to do that?! Normally, a user of CLOS wouldn't want to or need to do something so horrendous to an object, but if you are using a REPL and change a class definition by reloading a file, you want the system to update existing instances to the new schema. Rather than forcing the user to exit the REPL, recompile, reload, and then reconstruct his state, we tweak the objects behind the scenes and keep going.) If we combine these features we can potentially turn any instance into a procedure. (No, we don't want to design features that deliberately use this ability, we want to allow the user to type defclass..., say `Oops! I meant defgeneric...,' and do the fixup on the fly.)

The CLR has first-class functions called delegates. A delegate is basically an object that contains an object and a reference to a method implemented by the object. When the delegate is invoked, the method within the object is invoked and the answer is returned. If we consider the method to be a piece of code and the object to be an environment, then we can see that a CLR delegate is simply a closure. If we use delegates to implement funcallable-standard-instances, then our CLOS generic functions can be used anywhere a delegate can (modulo type co- and contra- variance). A non-funcallable standard-instance can simply be a delegate that refers to a method that raises an exception.

We need to associate some additional data with our standard-instances. First, an instance has an associated class object. While it is true that our delegates have an associated .NET type, it is always going to be the System.Delegate type and the .NET types are insufficient for our purposes. Second, an instance has an associated set of `slots'. We need the ability to both refer to these elements and to replace them in order to implement change-class. There are a couple of strategies that will work here. One is to hold the additional data in a weak hash table keyed by the delegate. This technique was used in the original version of Swindle, but the hash lookup will incur a fair amount of overhead. A portable technique is to define a helper class with the necessary fields:

class ManifestInstance
{
    StandardObject class;
    object [] slots;
    function onFuncall;
}


A StandardObject would be a delegate to a trampoline function in ManifestInstance. This technique works, but it seems kind of a kludge. We'd like to have our own special subclass of delegate that simply has two additional fields. (We'll have an additional benefit: the debugger will show us the class and slots as immediate subvalues in the delegate. If we indirect through a ManifestInstance, we have to manually chase down the pointer in the debugger.) We can create subclasses of delegates by using the delegate declaration in C#, but we can't add any new fields.

We can, however, do this trick in IL and assemble it into a signed assembly that can be placed in the GAC. We can then just add a reference to it in VC#.

I've added some code for this in http://jrm-code-project.googlecode.com/svn/trunk/Jrm.StandardObject/. I'm still tweaking it a bit, but it seems to be working.

Monday, September 17, 2007

I'm trying to figure out how the CLR type system will work with Lisp. My script does a simple thing of calling .write on the document object. It is currently erroring out because `.write' is undefined. Fair enough, I didn't define it. But it ought to be defined when I try to invoke it on the document object. Hmm.... maybe that's the key....

This just keeps getting better and better! Now I can't debug because something is timing out the script and aborting the process before I can look at what is going on.

Weekend update

Over the weekend I changed the script project to launch the script within a browser rather than from the command line. This allows me to debug from the IDE directly. I spent a fair amount of time trying to get the browser to stop complaining that my script language was an untrusted ActiveX control, but I had no luck. I implemented IObjectSafety, but then the browser stopped trying to run my script altogether. I added a reader to the Lisp engine so now the script can be turned into list structure. I'm adding a minimal evaluator so I can test out function calling. I haven't quite figured out how to integrate the package system with everything, but there is something minimal in place.

Thursday, September 13, 2007

Past that hurdle

The problem wasn't necessarily an out-of-memory issue, it was a mismatch in how native COM objects are marshaled to managed CLR objects. I wrote a custom marshaler that emits debugging messages and used it to figure out which argument was having problems. A little fussing around and I got it working just fine. At this point, the scripting host starts the scripting engine, calls `AddNamedObject' a couple of times, calls `ParseScriptText' with the source code of the script, and tells the engine to go. I need to suck some meta-data out of the named objects so that I can use them. That seems a little clunky, and I'm surprised that it isn't quite built in to the interop code. A little poking around in how VisualBasic and JScript do it lead me to believe that there isn't a good canonical mechanism. It is amazing this stuff works so well. It's impressive that it works at all. The code is on my code site at http://jrm-code-project.googlecode.com/svn/trunk/LScript/

Tuesday, September 11, 2007

Got it!

Apparently the scripting hosts were sending the strings over in ANSI format rather than Unicode. I'm not quite sure why this causes an out of memory error (maybe something was trying to copy the string and misread the initial characters as a size parameter?). Changing the MarshalAs attribute in the declaration of IActiveScript solved the problem.

Monday, September 10, 2007

A Clue

I downloaded ComSpy and found that when the Script host (WScript) is about to call IActiveScript::Method9 (which is AddNamedItem) it gets an HRESULT of 0x8007000E (Not enough storage is available to complete this operation.) Looking at the VM size of the WScript process, it gets up to 30MB by the time it tries to call that method. The CLR is clearly taking up quite a bit of room, but how do I persuade WScript to let it? There should be plenty more room.

More fooling around with scripting

Over the weekend I wrote a scripting engine for windows in c#. It doesn't do anything yet, it just implements the scripting API. The only problem is that it doesn't completely work. It doesn't completely fail, either. That's weird because I'd expect an all or nothing response. When I run the script under the cscript host, it creates an instance of the engine and calls the SetScriptSite and InitNew methods. Then it craps out with an error `CScript Error: Can't find script engine "LScript" for script. I suspect the actual error message is a red herring. I bet that the engine initialization code is located in a try/catch/finally block and the message comes from that block. What is weird is that since the SetScriptSite and InitNew methods are being invoked, it is clear that a whole lot of the machinery is working. Clearly the CLR runtime is being loaded, the engine assembly is loaded, the engine class is created and the com-callable wrapper for both the IActiveScript and IActiveScriptParse interfaces is being correctly created and delivered to the script host. So why does it fail? I ran the server from two different hosts: CScript and from within Internet Explorer. Both seem to be able to call the first few methods. A hint is that the next method each should call is AddNamedItem. Neither host seems to invoke that. I can't see what cscript is trying to do (the stack trace is truncated), so I wrote my own scripting host in C++. Unfortunately it seems to work with no problem. I also tried Hugh Brown's NullScript engine, and that seems to work just fine, too. For a while I thought I was on to something because I noticed that a typelib that was being generated was missing some entries. It made no difference when I corrected the problem, though. In any case, I put the code up at http://jrm-code-project.googlecode.com/svn/trunk/LScript/ and I'll probably play with it a bit more before I give up completely.

Saturday, September 8, 2007

Playing around

I decided to try making a scripting engine stub in C#. I got it partway working. The various scripting hosts can start the stub and invoke a few of the methods in it, but for some reason, they all fail at some point or another. It is a bit mysterious because I don't have a debugger that can handle this, so I'm relying on print statements. But one scripting host just stops calling the engine after initializing it. There doesn't seem to be an error, it just hangs up. I suspect it is a failed query for an interface, but I'm not sure how to go about finding out which interface.

Friday, September 7, 2007

I wonder

Why can't I script my windows box with Lisp or Scheme instead of VBScript or JScript?

Saturday, August 18, 2007

How I do love blogging

So I tried changing the blog settings to not convert line breaks and now all the paragraphs run together. I tried putting in extra blank lines between paragraphs and they disappear, but the first paragraph has extra large spacing between the lines. All these posts look fantastic in the preview mode, but as soon as I hit publish, the web gremlins of ugliness sprinkle their magic style directives on my text.

Another Couple of Graphs

I mentioned in an earlier post that I tried using a cumulative probability graph without much success. It turns out that this sort of graph comes in handy for answering particular kinds of questions. For example, I wanted to illustrate the various response times of one class of machines, so I generated this graph. (The actual graph has numbers along the X axis. Imagine that the tic marks represent `centons' in the middle, `microns' to the left and `arns' to the right.)

The percent axis tells you what percent of response times are faster than the graphed point. That is, to find the median response time, go to the 50% mark and read across and then down. You'll find that the median is a tad over 1 centon. Half of the responses happen faster, half take longer. The 92 percentile is at 1 'arn'. It is pretty unlikely you'd have to wait this long for an answer, but not impossible. Very few responses seem to come in at a `micron'. It turns out that many measurable properties have these S-shaped curves. That isn't surprising because this curve is simply the integral of the underlying distribution (which is often Gaussian). But plotting like this avoids the bucketing issues I was having earlier. If you plot a graph like this and find kinks and non-linearities in it, you have discovered some outliers. This graph shows one.
That hook-shaped kink in the upper right is caused by a particular bug which prevented some of the machines from responding until a certain timeout had expired. The timeout value is excessively long and this graph makes that obvious.
Empirically, my observed samples have a log-normal distribution. This is quite interesting because log-normal distributions behave quite differently from Gaussian distributions. For example, the average value falls quite far from the median. The log-normal distribution is also highly asymmetric. You need to use the geometric mean and geometric standard deviation rather than the more common arithmetic mean and standard deviation.

Thursday, August 9, 2007

What's missing from this picture?

A friend was conducting an interview the other day and asked the applicant to write a program that traversed a tree in depth-first order. The applicant started by defining a class for the nodes in the tree:
class Node {
    int value;
    Node parent;
}


Well, it is certainly space efficient. Optimized for going up the tree, too.

Tuesday, August 7, 2007

Naive Bayesian Classifier

In analyzing my data I wanted to classify it with a naive Bayesian classifier. I wasn't sure I had the math right, so I wrote a tiny abstract classifier to test with. The code is pretty cool:

(define (10log10 x)
  (* 10.0 (/ (log x) (log 10.0))))
;
(define (sum f list)
  (fold-left (lambda (sum element)
               (+ sum (f element)))
             0
             list))
;
(define (count-if predicate list)
  (sum (lambda (element)
         (if (predicate element)
             1
             0))
       list))
;
(define (count-if-not predicate list)
  (sum (lambda (element)
         (if (predicate element)
             0
             1))
       list))
;
(define (weight-if-present has-feature? positive-examples
                                        negative-examples)
  (10log10
   (/ (* (+ (count-if has-feature? positive-examples) .5)
         (length negative-examples))
      (* (+ (count-if has-feature? negative-examples) .5)
         (length positive-examples)))))
;
(define (weight-if-absent has-feature? positive-examples 
                                       negative-examples)
  (10log10
   (/ (* (+ (count-if-not has-feature? positive-examples) .5)
         (length negative-examples))
      (* (+ (count-if-not has-feature? negative-examples) .5)
         (length positive-examples)))))
;
(define (weight has-feature? 
                positive-examples negative-examples 
                test-element)
  (if (has-feature? test-element)
      (weight-if-present has-feature? positive-examples 
                                      negative-examples)
      (weight-if-absent  has-feature? positive-examples 
                                      negative-examples)))
;
(define (naive-bayesian-classifier feature-list positive-examples 
                                                negative-examples)
  (lambda (test-element)
    (sum (lambda (feature)
           (weight feature 
                   positive-examples negative-examples 
                   test-element))
         feature-list)))

.

Obviously this can radically improved for performance. Also note that your version of Scheme might take the arguments to fold-left in a different order. In Statistics and the War on Spam, Dave Madigan gives a really good introduction to naive Bayesian classifiers. The implementation above is based on that paper. The paper also describes why you add .5 to all the counts. To take the example directly from the paper:

;;; Message sets with stop words removed
;;; and word stemming applied.
(define *ham-messages* 
  '((brown fox quick)
    (rabbit run run run)))
;
(define *spam-messages* 
  '((quick rabbit run run)
    (rabbit rest)))
;
;;; A list of six procedures, each tests for
;;; the presence of a single word in a message.
(define *features*
  (map (lambda (word)
         (lambda (message)
           (member word message)))
       '(brown fox quick rabbit rest run)))
;
(define *example-classifier*
  (naive-bayesian-classifier 
   *features*
   *ham-messages*
   *spam-messages*))
;
;;; Classify a new message.
(*example-classifier* '(quick rabbit rest))
;Value: -11.426675035687316

.

The value returned differs from the one in the paper for two reasons. I use positive values to indicate `ham' and negative for `spam', rather than vice-versa. I also compute the logarithm in base 10 rather than the natural log and multiply the result by 10 (decibels rule!). This makes it much easier to eyeball the results. Negative is bad, and every 10 dB is another order of magnitude. With a value of -11, I can see that this is spam with a better than 90% probability. (10 dB = 90%, 20 dB = 99%, 30 dB = 99.9%, 40 dB = 99.99%, etc.) Stupid blogger removes blank lines in PRE sections, so I inserted semicolons. It also has problems transitioning back out of PRE, so I have periods in there as well. I don't like to blog anyway, and this just adds another reason to avoid it.

Wednesday, July 25, 2007

From Histograms to Scatter Plots

The histograms I generated helped people understand what the sample plots were showing, but they didn't really help me understand what was ultimately going on. For instance, it is clear that some of the histograms were bimodal or multimodal, but it wasn't clear what caused the different modes. I poked around at some of the data to see what caused some of the weirdness, and I found that some of the identifiable items were caused by misbehaving or misconfigured machines, or by machines that were being used for experiments. I didn't want to do this by hand, though, because it was rather time-intensive. I wanted to write a program that could automatically determine if a machine wasn't working correctly. It was clear that the machine with the load of 180 wasn't working right, but some of the other problem machines had much more subtle issues. I tried creating some scatter plots of things that ought to be roughly correlated, but I was disappointed to see that nothing obvious stood out. Since I was using a computer, though, I decided to simply create scatter plots of every variable against every other variable. This gave me a couple hundred plots. Most of them were uninteresting. If two variables are exactly correlated, the plot is simply a line. If they are completely uncorrelated, the plot is a blob. But if the variables are loosely correlated, or if the correlation depends upon the correct functioning of the machine, you end up with a scatter plot with unusual and distinctive features. One of these caught my eye: In this, we're plotting the load average (in dB) against a shared resource. The points outside the big blob are of interest. Where there is a cluster of points, like at (40, 7), I found that it was a single machine that was misconfigured in some particular way. The correctly working machines produced this part of the plot: There is an obvious multi-modal distribution here that would have been difficult to see by plotting just one or the other variable.

Where's the Lisp?

The first few graphs I made just by futzing around with Emacs and gnuplot. When I had to more processing by smoothing histograms and such I decided to use Scheme to manipulate the data. When I started thinking I had to convolve the data with gaussians I decided that I should use MIT Scheme because it has a pretty good compiler. The script that probes the machines was written in Scsh, and I made sure to emit the samples in a trivially readable format. Scheme has been great for experimenting with this. I've written a fair amount of ad-hoc code for generating the data for gnuplot, and most of the resulting graphs are uninformative. When I find some presentation that does show something of interest, I just replace the specific literals with variables and wrap a lambda around it. I only mention this because I was surprised at how easy it was to transition from a small collection of random fragments of code to a simple toolkit by mixing and matching higher-order abstractions and by noticing common patterns. It is completely undisciplined programming, but very powerful.

More Graphs

Although the graph of loads in decibels has some interesting features, I couldn't figure out how to correlate the bands in the graph to some meaningful fact about the machines. I tried plotting the load average against other quantities I had measured, but there was no obvious reason for what I was seeing. In desperation, I tried a few off-the-wall ideas. One was to sort the samples lexicographically by operating system version. This is the resulting graph: Now I was getting somewhere. It seems that the load averages are strongly affected by the underlying operating system. On one level, this isn't surprising, but since the various OS are supposed to be very similar, I expected much less variation. Apparently OS selection is much more important than I thought. I found a problem with this graph. When I showed it to people, they wanted to know what the X axis is (it isn't much of anything in this graph except that each data point has a different value for X). Lexicographical ordering of the OS isn't really related to the linear axis other than for grouping purposes. The important feature is that different OSs have different bands of density in the sample space. What I really needed was to separate the graphs and plot the density as a histogram. This turns out to be a lot easier said than done. Here are the load averages for one particular OS (the one that occupies the range from 7000 to 9500 in the earlier graph). You can just make out two density bands. What we want is a simple histogram of the density as it changes when we go from the bottom to the top of the graph. No problem: But something is wrong here. There are some bumps in the graph where you would expect, but the sizes are wrong. The peak for the loads at -5dB is much bigger than the one at 0dB, but the density of points at -5db doesn't seem much higher than that at 0dB. And shouldn't there be a blip up at 23dB? Since we are using the logarithm of the load average, the buckets for the histogram are non uniform (in fact, they are exponentially decreasing in size). We can fix this by multiplying by a normalization factor that increases exponentially. This is more realistic, but the exponential multiplication gives some weird artifacts as we approach the right-hand side. Since we know that load averages won't normally exceed 13dB, we really want to just look at the middle of the graph. The problem we are encountering here is that graph is far too jagged as we approach the right-hand side. This is because the buckets are getting very fine-grained as the load gets higher. We want to smooth the histogram, but we want the smoothing to encompass more and more buckets as we reach the higher loads. I spent a couple of days trying to figure out how to convolve this graph with a continuously varying gaussian curve. I actually got somewhere with that, but it was really hard and very slow (you have to do a lot of numeric integration). Raph Levien suggest punting on this approach and just plotting the accumulated loads. I tried this, but for this particular problem it doesn't give the information I wanted. (I mention this because the idea turns out to be applicable elsewhere.) Over the weekend I had an insight that in retrospect seems obvious. I guess I was so stuck in the approach I was working on that I didn't see it. (The seductive thing about my approach is that I was making progress, it just was getting to the point of diminishing returns). The buckets are hard to deal with because they vary in size. They vary in size because I took the logarithm of the load. If I simply perform the accumulation and smoothing before taking the logarithm all the problems go away. (Well, most of them, I still need to vary the smoothing, but now I can vary it linearly rather than exponentially and I can use addition rather than convolution). Here is what the graph looks like when you do it right: In order to check if I was doing it right, I generated some fake data with varying features (uniform distributions, drop-outs, gaussians, etc.) and made sure the resulting histogram was what I expected. I'm pretty confident that this histogram is a reasonably accurate plot of the distribution of load averages for the system in question. (more later...)

Some Graphs, Maybe

Arthur Gleckler suggested that I put up some graphs to illustrate the data analysis I was talking about earlier. I've decided to give it a try. Here are the raw load averages collected from a whole bunch of machines. What do you know? It worked! This is the first graph I made from the data I've collected. I wasn't really expecting much, but there are a few things to notice: first, there is a really unhappy machine with a load average in the 170 range. Second, my scanning program does not poll at regular intervals. The unhappy machine turned out to be missing an important library. Unfortunately, the software reacted to this problem by spawning a process that itself depended on the library. When I checked, the machine had several thousand failing processes. Other than those two facts, the graph isn't very informative. Everything is squeezed down into the bottom. But watch what happens when I convert the load average to decibels. Now things start to get interesting. In the middle of the graph are two bands of higher density. This indicates a bimodal distribution of loads. It would be good to find out what causes that. Another thing to notice is that if we ignore the misconfigured machine, the graph has a fairly sharp upper edge at about 13dB. We can say that if the machine load ever gets above 13dB, it is clearly in trouble. At the bottom of the graph the load averages fall into discrete lines. This is an artifact of the way loads are reported. They are rounded to the nearest hundredth, and when we show them in log scale, the rounding becomes obvious at the bottom of the graph. (more soon...)

Thursday, July 19, 2007

An Unorthodox Idea: Measure Computer Performance in dB

I promised some unorthodox ideas, so here's one. When I was at Northeastern University I would frequently suggest that people report the logarithms of the benchmarks they ran. I mentioned that a side benefit was that it would make you feel like a "real" engineer. Will Clinger said that a real engineer would be using decibels. I of course immediately converted my latest benchmarks to dB. I was just taking things to the absurd limit, but I noticed something remarkable: the dB values were exceptionally easy to understand. Human senses tend to work logarithmically. This allows people to sense a very wide range of sounds, light intensities, vibrations, etc. As it happens, the decibel is a particularly convenient unit for measuring things that people sense. For a large variety of phenomena, a 1 dB change is the `just noticable difference'. A 3dB change is almost exactly a factor of 2, and every 10dB is another order of magnitude. It's pretty easy to get the hang of it. To give an example, let's convert the benchmark times from the previous post to dB:

C gcc                    7.6
D Digital Mars           8.2
Clean                    8.3
Lisp SBCL #2             8.6
Oberon-2 OO2C            8.7
Pascal Free Pascal #3    8.8
D Digital Mars #2        8.9
OCaml                    9.1
Eiffel SmartEiffel       9.1
Ada 95 GNAT              9.9
C++ g++                 10.0
Nice                    11.4
Java 6 -server          11.7
Scala #2                11.7
CAL                     12.3
BASIC FreeBASIC #2      12.3
SML MLton               12.5
Haskell GHC #2          12.6
C# Mono                 12.8
Fortran G95             13.6
Forth bigForth          13.9
Haskell GHC             18.4
Smalltalk VisualWorks   19.3
Erlang HiPE             19.9
Erlang HiPE #2          19.9
Scheme MzScheme         21.5
Scala                   24.8
Haskell GHC #3          26.5
Lua #3                  27.7
Pike                    27.8
Python                  28.1
Mozart/Oz #2            28.7
Perl #2                 29.6
PHP                     30.7
Tcl #2                  31.6
Ruby                    32.5

.

So what can we see? SBCL is just a tad slower than C gcc, but it is a tad faster than C++ g++. Scheme MzScheme is an order of magnitude slower than C++, but Perl is yet another order of magnitude slower. Between MzScheme and Scala you lose a factor of 2. There are other things you can do with dB, for example, you can measure compiler performance. A Scheme compiler that improves performance by, say, 12dB would move the Scheme runtime up near the Ada one. You might decide that a compiler tweak that improves performance by less than 1dB probably isn't worth it. Try converting some of your own performance numbers to dB and see what you think.

Reporting Performance

I've seen a lot of different ways of reporting performance benchmark results. The most popular way seems to be simply listing the elapsed running time. Sometimes there is a bar chart or graph. Every now and then the relative speed is reported against some baseline. There are some oddballs that report benchmark times in Hertz. All of these methods suffer from the same problem: the relative performance is not scale-free. Suppose you ran a benchmark on four different machines. Machine A takes 1.2 seconds, machine B takes 2.4 seconds, machine C takes 60.1 seconds, and machine D takes 80.7 seconds. Machine A is clearly winner coming in twice as fast as the next entry. But although machine C beats out machine D by a full 20 seconds, it isn't twice as fast as D. B would have to double to catch up with A, but D only needs to run 25% faster to catch up with C. If you plot these results on a graph or bar chart, you'd see that gap between D and C is much larger than the gap between B and A, but large gaps are to be expected when the time scale is larger. This problem is easy to fix. Simply take the logarithm of the run time. In the example above, the log times for A, B, C, and D are 0.18, 0.88, 4.10, and 4.39 respectively. Now A and B differ by .7 while C and D differ by .29 It is obvious that C is closer to D than B is to A. To give a real-life example, I grabbed the results of the fannkuch benchmark from the Computer Language Shootout. First, the timings as reported in the shootout:

C gcc                   5.82
D Digital Mars          6.57
Clean                   6.78
Lisp SBCL #2            7.20
Oberon-2 OO2C           7.39
Pascal Free Pascal #3   7.60
D Digital Mars #2       7.80
OCaml                   8.06
Eiffel SmartEiffel      8.22
Ada 95 GNAT             9.78
C++ g++                 9.95
Nice                   13.89
Java 6 -server         14.63
Scala #2               14.67
CAL                    16.93
BASIC FreeBASIC #2     17.15
SML MLton              17.93
Haskell GHC #2         18.32
C# Mono                18.85
Fortran G95            23.02
Forth bigForth         24.46
Haskell GHC            69.09
Smalltalk VisualWorks  84.80
Erlang HiPE            97.60
Erlang HiPE #2         98.30
Scheme MzScheme       139.75
Scala                 299.77
Haskell GHC #3        441.82
Lua #3                582.46
Pike                  598.58
Python                641.36
Mozart/Oz #2          739.06
Perl #2               906.29
PHP                  1165.02
Tcl #2               1456.69
Ruby                 1786.76

.

Now the timings in log scale:

C gcc                  1.76
D Digital Mars         1.88
Clean                  1.91
Lisp SBCL #2           1.97
Oberon-2 OO2C          2.00
Pascal Free Pascal #3  2.03
D Digital Mars #2      2.05
OCaml                  2.09
Eiffel SmartEiffel     2.11
Ada 95 GNAT            2.28
C++ g++                2.30
Nice                   2.63
Java 6 -server         2.68
Scala #2               2.69
CAL                    2.83
BASIC FreeBASIC #2     2.84
SML MLton              2.89
Haskell GHC #2         2.91
C# Mono                2.94
Fortran G95            3.14
Forth bigForth         3.20
Haskell GHC            4.22
Smalltalk VisualWorks  4.44
Erlang HiPE            4.58
Erlang HiPE #2         4.59
Scheme MzScheme        4.94
Scala                  5.70
Haskell GHC #3         6.09
Lua #3                 6.37
Pike                   6.39
Python                 6.46
Mozart/Oz #2           6.61
Perl #2                6.81
PHP                    7.06
Tcl #2                 7.28
Ruby                   7.49

.

There are a couple of features that weren't obvious in at first. There is a noticable gap between C++ g++ and Nice, a huge gap between Forth bigForth and Haskell GHC #2, and another between Scheme MzScheme and Scala. Lua #3 is pretty close to Pike, even though they differ by 16 seconds real-time, but Nice and g++ are further apart even though they differ by less than 4 seconds of real time. I have a further refinement in the next post.

Wednesday, July 18, 2007

Well, I wrote some clustering code, but it isn't staggering into life yet. I've been distracted by a new problem at work: making sense out of tens of thousands of data points. We have a number of computers that the QA group uses. I had the idea that we should probe them and collect some statistics. With enough statistics, maybe we'll see some trends or something. I wrote a small script that goes to each QA machine, pings it, and if it is alive, logs in and collects a bunch of basic info like physical memory in use, swap space in use, load average, number of threads, etc. The results are stored in a file with the machine name and a timestamp for when the sample was taken. Over the past couple of weeks I've collected over 100 thousand samples of about 13 different values in each sample. (Parts of the sample are pretty static: the total size of physical memory doesn't change, and the operating system is only changed rarely, but these are interesting values to have at hand.) The problem now is to extract some information from this pile of raw data. This isn't as easy or obvious as I had thought it would be. My first experiment was to examine the load averages. I figured that maybe we could tell the difference between machines that were in use and those that were idle, or between correctly running machines and broken ones. Maybe we'd see a correlation between the amount of RAM and the load. Little did I know.... (more later)

Tuesday, July 3, 2007

Playing with Bayesian filters

I wrote a simple bayesian filter as a prelude to the clustering stuff I've been thinking about. Nothing too fancy here, but I was playing with some series code as well. I've been using the Lingspam corpus to measure against.

Tuesday, June 26, 2007

How Floating Point Works

A floating-point number is neither a Real number nor an Interval --- it is an exact rational number. A floating-point number is not uncertain or imprecise, it exactly and precisely has the value of a rational number. However, not every rational number is represented. Floating-point numbers are drawn from a carefully selected set of rational numbers. The particulars vary among the different floating-point formats. The popular IEEE 754 standard uses these: Single Precision The significand is an integer in the range [8388608, 16777215) which is multiplied by a power of 2 in the range 2-149 = 1/713623846352979940529142984724747568191373312 2104 = 20282409603651670423947251286016 Double Precision The significand is an integer in the range [4503599627370496, 9007199254740991) which is multiplied by a power of 2 in the range 2-1074 = 1/202402253307310618352495346718917307049556649764142118356901358027430339567995346891960383701437124495187077864316811911389808737385793476867013399940738509921517424276566361364466907742093216341239767678472745068562007483424692698618103355649159556340810056512358769552333414615230502532186327508646006263307707741093494784 2971 = 19958403095347198116563727130368385660674512604354575415025472424372118918689640657849579654926357010893424468441924952439724379883935936607391717982848314203200056729510856765175377214443629871826533567445439239933308104551208703888888552684480441575071209068757560416423584952303440099278848 Since the power of 2 can be negative, it is possible to represent fractions. For instance, the fraction 3/8 is represented as

            -25     12582912
12582912 * 2   = ------------- = 3/8
                    33554432

in single precision floating point. In double-precision, 3/8 is represented as

                    -54    6755399441055744
6755399441055744 * 2   = ------------------- = 3/8
                          18014398509481984
. Not every fraction can be represented. For example, the fractions 1/10 and 2/3 cannot be represented. Only those fractions whose denominator is a power of 2 can be represented. It is possible to represent integers. For instance, the integer 123456 is represented as

            -7    15802368
15802368 * 2   = --------- = 123456
                       128
. Not every integer can be represented. For example the integer 16777217 cannot be represented in single precision. -------- When floating-point numbers appear in programs as literal values, or when they are read by a program from some data source, they are usually expressed as a decimal number. For example, someone might enter "38.21" into a form. Some decimal numbers, like 0.125, can be represented as a floating-point number, but most fractional decimal numbers have no representation. If a literal number cannot be represented, then the system usually silently substitutes a nearby representable number. For example, if the expression 0.1 appears as a literal in the code,the actual floating point value used may be

            -27     13421773 
13421773 * 2   = ------------- > .1
                   134217728 
. as another example, 38.21

            -18     10016522
10016522 * 2   = ----------- < 38.21
                      262144
. Most implementations substitute the nearest representable number,but this has not always been the case. As it turns out, much of the error in floating-point code comes from the silent substitution performed on input. -------- Floating-point numbers are usually printed as decimal numbers. All binary floating-point numbers have an exact decimal equivalent, but it usually has too many digits to be practical. For example, the single-precision floating-point number

            -25     13421773
13421773 * 2    = ----------- = 0.4000000059604644775390625
                    33554432
. Furthermore, the extra digits may be misconstrued as precision that doesn't exist. The next largest float above that:

             -25     6710887
13421774 * 2    = ----------- = 0.400000035762786865234375
                    16777216
. doesn't share any digits beyond the first 7. There is no standard way to print the decimal expansion of a floating-point number and this can lead to some bizarre behavior. However, recent floating point systems print decimal numbers with only as many digits as necessary to ensure that the number will be exactly reconstructed if it is read in. This is intended to work correctly even if the implementation must substitute a nearby representable number on input. This allows one to `round-trip' floating-point numbers through their decimal representation without error. (NOTE: Not all floating point systems use this method of parsimonious printing. One example is Python.) There is a curious side effect of this method of printing. A literal floating-point number will often print as itself (in decimal) even if it was substituted in binary. For example, if the expression 0.1 appears as a literal in the code, and is substituted by

            -27      13421773 
13421773 * 2    = -------------
                    134217728 
. this substitute number will be printed as "0.1" in decimal even though it is slightly larger than 0.1 As mentioned above, Python does not use parsimonious printing, so it is often the case in Python that a literal floating-point number will not print as itself. For example, typing .4 at the Python prompt will return "0.40000000000000002" The round-trip property is nice to have, and it is nice that floating-point fractions don't seem to change, but it is important to know this fact: The printed decimal representation of a floating-point number may be slightly larger or smaller than the internal binary representation. -------- Arithmetic Simple arithmetic (add, subtract, multiply, or divide) on floating-point numbers works like this: 1. Perform the operation exactly. Treat the inputs as exact rationals and calculate the exact rational result. 2a. If the exact rational result can be represented, use it. 2b. If the exact rational result cannot be represented, use a nearby exact rational that can be represented. (round) This is important, and you may find this surprising: Floating-point operations will produce exact results without rounding if the result can be represented. There are a number of interesting cases where we can be assured that exact arithmetic will be used. One is with addition, multiplication,and subtraction of integers within the range of the significand. For single-precision floating point, this is the range [-16777215,16777215]. For double-precision, [-9007199254740991,9007199254740991]. Even integer division (modulo and remainder) will produce exact results if the inputs are integers in the correct range. Multiplication or division by powers of two will also produce exact results in any range (provided they don't overflow or underflow). In many cases floating-point arithmetic will be exact and the only source of error will be the substitution when converting the input from decimal to binary and how the result is printed. However, in most cases other than integer arithmetic, the result of a computation will not be representable in the floating point format. In these case, rule 2b says that the floating-point system may substitute a nearby representable number. This number, like all floating-point numbers, is an exact rational, but it is wrong. It will differ from the correct answer by a tiny amount. For example, in single precision adding .1 to 0.375:

             -27     13421773 
13421773 * 2    = ------------- ~= 0.1
                    134217728 

             -25     12582912
12582912 * 2   = ------------- = 0.375
                     33554432
. gives us a result that is slightly smaller than the true answer of 19/40

            -25     15938355       19
15938355 * 2   = ------------- < ---
                    33554432       40

-------- Further topics 1. Denormalized numbers and gradual underflow. These give you a `soft landing' when you use numbers really close to 0. 2. Infinities 3. Negative 0.0 4. NaN's 5. Traps and flags -------- Floating point myths Myth: Floating-point numbers are an approximation. Fact: Floating-point numbers are exact, rational numbers. Myth: Floating-point math is always only an approximation, and all results are off by a small amount. Fact: Many floating-point operations produce exact, correct answers. Myth: Floating-point math is mysterious. Fact: It is the same old rational arithmetic you've used for ages. -------- Bibliography @article{ goldberg91what, author = "David Goldberg", title = "What Every Computer Scientist Should Know About Floating-Point Arithmetic", journal = "ACM Computing Surveys", volume = "23", number = "1", pages = "5--48", year = "1991", url = "citeseer.ist.psu.edu/goldberg91what.html" } All the papers at http://www.cs.berkeley.edu/~wkahan/ Joe Darcy's talk at http://blogs.sun.com/darcy/resource/Wecpskafpa-ACCU.pdf

Sunday, June 24, 2007

Why floating-point numbers are not intervals

To understand floating point arithmetic, we have to describe how floating-point numbers map to mathematical numbers and how floating point operations map to mathematical operations. Many people believe that floating point arithmetic is some variation of interval arithmetic. This view can be described in this picture:

                    float operation
     floating  ------------------------>  floating
      point                                 point
        |                                     ^
        |                                     |
        |                                     |
        |                                     |
        |                                     |
        V         interval operation          |
     interval ------------------------->  interval

Interval view In this view, there is a mapping from floating point numbers to intervals. For each floating point operation, there is an operation on the intervals that the floats map to. The resulting intervals are then mapped to the corresponding floating point result. An alternative view is this:

                    float operation
     floating  ------------------------>  floating
      point                                 point
        |                                     ^
        |                                     |
        |                                  possible
        |                                  rounding
        |                                     |
        V      exact rational operation       |
     rational ------------------------->  rational

Rational View In this view, each floating point number represents an exact rational number. Operations on floating point numbers are mapped to the corresponding exact rational operation. If necessary, the resulting exact rational number is mapped to an appropriate floating point number by rounding. Since some rationals correspond exactly to floating point values, rounding may not be necessary. Are these views equivalent? For our purposes, an interval is a set of real numbers we can describe by two endpoints, the lower bound and the upper bound. Either of the endpoints may be part of the interval or not. Given an arbitrary real number X, any pair of real numbers, one of which is below the rational, one of which is above, will define an interval that contains X. There is no one-to-one mapping from rational numbers to intervals: a rational number such as 5/8 is contained by any interval with a lower bound less than 5/8 and an upper bound greater than 5/8. For example, the intervals [0, 30), (1/2, 1], and (9/16, 11/16) all contain 5/8. Also, any non-degenerate interval (that is, where the endpoints are distinct), contains an infinite number of rationals. These views cannot be semantically equivalent, but are they *computationally* equivalent? A computation that used the interval model may yield the same floating point result as one that performed under the rational model. (Obviously, we would reject a model which didn't give the result `1 + 1 = 2'.) But do all computations performed by one model yield the same answer as the other? If not, then we can at least distinguish which model an implementation uses if not decide which model is better. To determine computational equivalence, we need to be precise about what operations are being performed. To aid in this, I have defined a tiny floating point format that has the essential characteristics of binary floating point as described by IEEE 754. This format has 2 bits of significand and 3 bits of mantissa. The significand can take on the values 0-3 and the exponent can take on the values 0-7. We can convert a tiny float to an exact rational number with this formula: (significand + 5) * 2 (exponent - 5) There are 32 distinct tiny floating-point values, the smallest is 5/32 and the largest is 32. I use the notation of writing the significand, the letter T, then the exponent. For example, the tiny float 2T3 is equal to (2 + 5) * (2 ^ (3 - 5)) = 1 3/4. Since there are so few of them, we can enumerate them all:

0T0 = 5/32     0T2 = 5/8    0T4 = 5/2   0T6 = 10  
1T0 = 3/16     1T2 = 3/4    1T4 = 3 1T6 = 12    
2T0 = 7/32     2T2 = 7/8    2T4 = 7/2  2T6 = 14  
3T0 = 1/4      3T2 = 1      3T4 = 4 3T6 = 16    
0T1 = 5/16     0T3 = 5/4    0T5 = 5 0T7 = 20    
1T1 = 3/8      1T3 = 3/2    1T5 = 6 1T7 = 24    
2T1 = 7/16     2T3 = 7/4    2T5 = 7 2T7 = 28    
3T1 = 1/2      3T3 = 2      3T5 = 8    3T7 = 32  

Table of tiny floats We can also enumerate the intervals on the real number line that map to tiny floats:

0T0 = [5/32  11/64]   0T2 = [ 9/16 11/16]
1T0 = (11/64 13/64)   1T2 = (11/16 13/16)
2T0 = [13/64 15/64]   2T2 = [13/16 15/16]
3T0 = (15/64  9/32)   3T2 = (15/16  9/8 )
0T1 = [ 9/32 11/32]   0T3 = [ 9/8  11/8 ]
1T1 = (11/32 13/32)   1T3 = (11/8  13/8 )
2T1 = [13/32 15/32]   2T3 = [13/8  15/8 ]
3T1 = (15/32  9/16)   3T3 = (15/8   9/4 )

0T4 = [ 9/4 11/4]     0T6 = [ 9 11] 
1T4 = (11/4 13/4)     1T6 = (11 13) 
2T4 = [13/4 15/4]     2T6 = [13 15] 
3T4 = (15/4  9/2)     3T6 = (15 18) 
0T5 = [ 9/2 11/2]     0T7 = [18 22] 
1T5 = (11/2 13/2)     1T7 = (22 26) 
2T5 = [13/2 15/2]     2T7 = [26 30] 
3T5 = (15/2  9)       3T7 = (30 32) 

Table of real intervals Arithmetic on intervals is relatively easy to define. For closed intervals, we can define addition and subtraction like this: [a b] + [c d] = [a+c b+d] [a b] - [c d] = [a-d b-c] Multiplication and division are a bit trickier: [a b] * [c d] = [min(ac, ad, bc, bd) max(ac, ad, bc, bd)] [a b] / [c d] = [min(a/c, a/d, b/c, b/d) max(a/c, a/d, b/c, b/d)] The formulae for open and semi-open intervals are similar. A quick check: 1 is in the interval (15/16 9/8). (15/16 9/8) + (15/16 9/8) = (15/8 9/4) 2 is in the interval (15/8 9/4), so 1 + 1 = 2. So far, so good, but we will soon discover that interval arithmetic has problems. The first hint of a problem is that the intervals are not of uniform size. The intervals at the low end of the line are much smaller than the intervals at the high end. Operations that produce answers much smaller than the inputs will have answers that span several intervals. For example, 20 - 14 20 is in the interval [18 22] 14 is in the interval [13 15] [18 22] - [13 15] = [3 9] We have 7 intervals that fall in that range, but none are large enough to cover the result. The second hint of a problem comes when multiply. It is true that 1+1=2, but does 1*2=2? 1 is in the interval (15/16 9/8) 2 is in the interval (15/8 9/4) (15/16 9/8) * (15/8 9/4) = (225/128 81/32) This is clearly not the same as (15/8 9/4). It seems that we have lost the multiplicative identity. An adherent of the interval model will have to come up with a way to map these arbitrary result intervals back into the floating-point numbers. In an upcoming post, I'll describe the rational model in more detail.

Monday, June 18, 2007

If you have studied information theory, you might have noticed the question-bits procedure in the last post. It is the binary case ofthe Shannon Entropy. This makes sense: the entropy is the amount of information we can get out of the answer to a yes-or-no question when we have already been told the cluster. This also supplies us with a good idea of what makes a good cluster: the entropy after clustering should be as small as possible. That is, if the cluster is very homogeneous, then there aren't very many good yes-or-no questions that can narrow down your guess. Everything in the cluster has the same answer to almost all the questions. This would be a good clustering. If the cluster is very heterogeneous, then you'd expect a yes-or-no question to divide the cluster further. This would be a poor clustering. As it turns out, other people have been down this path: Barbara, Couto, and Li wrote a paper describing `COOLCAT: An entropy-based algorithm for categorical clustering'. So I guess I gotta write some code to try this out....

Friday, June 15, 2007

A Short Digression into a Question of Bits

Anyone experienced with 20 Questions knows that you should ask questions that tend to cut the remaining possibilities in about half. Computer hackers know that you can distinguish a maximum of 2^20(1048576) distinct objects with twenty yes-or-no questions. Given a set of distinct objects, you need about log2 bits to enumerate them. Intuitively, a yes-or-no question that divides your sample space in half gives you 1 bit of information. But what if your question doesn't divide the space evenly? Again, it is intuitive that if your question has the answer `yes' for every object, then your question doesn't help narrow things down at all. This is true if your question has the answer `no' for every object as well. So if we were to plot the amount of information we get from asking a question based upon the percentage of `yes' answers, we'd have a curve that starts and ends at zero and has a peak of 1 bit at exactly the half-way point. Ok, so suppose your yes-or-no question is `yes' for 1/4 of the objects. If the answer comes out yes, you have gotten the effect of two `binary' questions, which would be 2 bits of information. So the amount of information you get from a question depends on how much you trim the object space. This would be mathematically -log2(x) where x is the ratio of the trimmed space to the original space. For example,-log2(1/2) = 1 bit, 1 bit of information if you trim the space inhalf. -log2(1/4) = 2 bits if you trim the space to 1/4 its originalsize. As a check, -log2(1/1048576) = 20 bits --- if we go from 1048576 down to a single object we have gained 20 bits ofinformation. But the reason we don't ask questions that trim several bits off the space of objects is because they are usually answered in the negative. If our yes-or-no question is 'yes' for 1/4 of the objects,and the answer comes out `no', we only gain -log2(3/4) = 0.415 bits.(ok, fractional bits are weird) So when we ask a yes-or-no question we need to take into account both possible answers. Well, since the answer is 'no' 3/4 of the time, we expect to get .415 bits 3/4 of the time, and we expect to get 2 bits the other 1/4 of the time. The average is .81 bits. We can write a formula for computing this sort quantity:

(defun log2 (n)  "Log base 2" (/ (log n) (log 2)))
(defun question-bits (p)
  "Given the probability that a question is true, return
   the number of bits of information you can expect by
   asking the question."
  (let ((p~ (- 1.0 p)))
    (+ (* p  (- (log2 p)))
       (* p~ (- (log2 p~))))))

(question-bits .5) =>  1.0   ; a fifty-fifty question gives us one bit
(question-bits .25) =>  .81  ; a 1/4 question gives us 8/10 of a bit
(question-bits .75) =>  .81  ; same answer (of course)
(question-bits .1100278644383595) => .5

If we could only come up with questions that were true about 1/10th of the time, we'd have to play 40 questions. Back to clustering next....

Thursday, June 14, 2007

Pondering about clusters

I don't know much about unsupervised machine learning, so I'm trying to figure it out for myself. This will probably be laughably naive to someone that knows this stuff. Go ahead and laugh. I got on a Bayesian clustering kick yesterday and read a few papers. Using Unlabeled Data to Improve Text Classification by Kamal Paul Nigam seemed relevent. His thesis was that you could use a little bit of labeled data and a lot of unlabeled data to train a text classifier. I was wondering if this works in the limit of no labeled data at all. This looked like a good starting point, and I started thinking of the generative model of email and spam. I spent a few hours pondering this when I realized that I'm approaching this incorrectly. I need to understand what a `cluster' is in the first place before I can figure out how to get the machine to recognize one. There seem to be a number of different theories about how to cluster data. It is hard to decide what is the `best'. They all seem to have a certain amount of `magic', too, usually in the form of inpenetrable math. I thought I knew some math when I went to college, but some of these things look pretty bizarre. It isn't obvious to me what constitutes good clustering. At one limit, we have one cluster for each and every email that holds exactly that email. The clusters are *very* precise, but not very helpful. At the other limit, we have one cluster for the entire corpus of email. It isn't very precise or helpful. Somewhere in the middle we have a set of clusters each of which most likely contain several messages. There has to be some measure that assigns a high value to this middle ground and a low value to the limiting case. If you're like me, you're already thinking about drawing circles around groups of little dots on some graph paper, and how to decide the right size of circle, where to put it, etc. But I'm getting ahead of myself. What exactly *is* a cluster? Why can't I just take every fifth email, put it in a bag and call that a cluster? The point of clustering is to capture some similarity between objects. If I am thinking of a particular email, and I ask you a question about it, you'd have no idea at all what the answer might be. But if I were to tell you that the email I'm thinking of is from my `spam' cluster, then you'd be able to answer a few questions about it. For instance, you might infer that the liklihood that the email solicits money is *much* higher than if you didn't know the clustering. So if you know that an object belongs to a cluster, you now know something about the object itself. A cluster must have a certain amount of information associated with it. This is starting to make a little sense. If I were to randomly assign email to clusters, then the fact that an email was in cluster 5 would be fairly meaningless because we'd expect cluster 5 to be more or less the same as cluster 4 and have the same statistical characteristics as the entire corpus of email. So the measure of a good clustering has to be related to the information content of that cluster. To be continued....

Wednesday, June 13, 2007

Buckets 'o Spam!

I have assets exceeding twenty million dollars in a Nigerian Bank. I don't need to make a withdrawal because I can refinance my home with no money down and get cash back in the process. But this is nothing compared to my 37 inch, um, wallet. And I can stand at attention for hours, much to the delight of the lonely but attractive women who have decided to email me. Oh. You, too, eh? A few years back I decided I should be proactive and try to figure out what to do about spam. I read Paul Graham's `Plan for Spam' and decided to make my own spam filter in Emacs. It took a while, mainly because I really wanted to understand the principles behind the Bayesian approach. Ultimately, I got my filter working, but it just didn't perform as well as I had hoped. I was considering more complex approaches, but the university installed SpamAssassin and it did a pretty good job, and I wasn't getting paid to design spam filters. But now and again I want to pursue an idea that sounded promising to me. Nigerian 419 spam really doesn't have much in common with the discount drug ads, and neither share many features with the cheap mortgage spams. I did a bit of research and found that the bulk of spam falls into about 10 major catagories. A narrow Bayesian filter could easily pick out one of these catagories, but you need a fairly broad filter to handle them all. But a broad Bayesian filter is more prone to false positives. It occurred to me that a set of Bayesian filters, each tuned to a particular catagory of spam, might perform much better than a single filter attempting to cover spam in general. The problem is training such a filter. You need an already classified corpus of examples in order to compute the statistics for a Bayesian filter. But I *hate* this stuff! I'm not going to wade through thousands of spams and try to manually classify them into the various sordid catagories. And I'd have to retrain the filters each time a new scam comes along. So rather than filtering my mail, I want to categorize it. I want a program to look over my incoming mail and simply group it into buckets based on similarity. The categorizer won't know that a set of emails are spam. It doesn't need to. It will simply place all the similar email into the same bucket. When I go to read my mail, I'll take a peek in each bucket. ``V1AGR@" -- nope, not interested. ``Continuation-passing-style'' --- ok, I'll read the email in this bucket. I'll have the added benefit of auto-generated folders for the email I *want* to read. But I don't want to have a pre-defined set of categories. I want the machine to perform `unsupervised clustering'. This is a much harder task than simple Bayesian filtering. I've been thinking about this recently, and I figured I'd share my confusion. My current line of inquiry is looking into Bayesian clustering. Some of the literature is promising, but a friend with more experience in the field said that nothing she saw looked very good. More to come, I hope.

My Erdös Number

4

Thursday, June 7, 2007

Domain-specific Languages in Lisp

On 6/7/07, Grant Rettke wrote:
>
> That said, when I think of a DSL think about letting folks write
> "programs" like:
>
> "trade 100 shares(x) when (time < 20:00) and timingisright()"
>
> When I think about syntax transformation in Lisp I think primarily
> about language features.

In order to talk about domain-specific languages you need a definition of what a language is. Semi-formally, a computer language is a system of syntax and semantics that let you describe a computational process. It is characterised by these features:

  1. Primitive constructs, provided ab-initio in the language.
  2. Means of combination to create complex elements from the primitives.
  3. Means of abstraction to control the resulting complexity.

So a domain-specific language would have primitives that are specific to the domain in question, means of combination that may model the natural combinations in the domain, and means of abstraction that model the natural abstractions in the domain.

To bring this back down to earth, let's consider your `trading language':
"trade 100 shares(x) when (time < 20:00) and timingisright()"

One of the first things I notice about this is that it has some special syntax. There are three approaches to designing programming language syntax. The first is to develop a good understanding of programming language grammars and parsers and then carefully construct a grammar that can be parsed by an LALR(1), or LL(k) parser. The second approach is to `wing it' and make up some ad-hoc syntax involving curly braces, semicolons, and other random punctuation. The third approach is to `punt' and use the parser at hand.

I'm guessing that you took approach number two. That's fine because you aren't actually proposing a language, but rather you are creating a topic of discussion.

I am continually amazed at the fact that many so-called language designers opt for option 2. This leads to strange and bizarre syntax that can be very hard to parse and has ambiguities, `holes' (constructs that *ought* to be expressable but the logical syntax does something different), and `nonsense' (constructs that *are* parseable, but have no logical meaning). Languages such as C++ and Python have these sorts of problems.

Few people choose option 1 for a domain-specific language. It hardly seems worth the effort for a `tiny' language.

Unfortunately, few people choose option 3. Part of the reason is that the parser for the implementation language is not usually separable from the compiler.

For Lisp and Scheme hackers, though, option 3 is a no-brainer. You call `read' and you get back something very close to the AST for your domain specific language. The `drawback' is that your DSL will have a fully parenthesized prefix syntax. Of course Lisp and Scheme hackers don't consider this a drawback at all.

So let's change your DSL slightly to make life easier for us:

 (when (and (< time 20:00)
            (timing-is-right))
     (trade (make-shares 100 x)))

When implementing a DSL, you have several strategies: you could write and interpreter for it, you could compile it to machine code, or you could compile it to a different high-level language. Compiling to machine code is unattractive because it is hard to debug, you have to be concerned with linking, binary formats, stack layouts, etc. etc.

Interpretation is a popular choice, but there is an interesting drawback. Almost all DSLs have generic language features. You probably want integers and strings and vectors. You probably want subroutines and variables. A good chunk of your interpreter will be implementing these generic features and only a small part will be doing the special DSL stuff.

Compiling to a different high-level language has a number of advantages. It is easier to read and debug the compiled code, you can make the `primitives' in your DSL be rather high-level constructs in your target language.

Lisp has a leg up on this process. You can compile your DSL into Lisp and dynamically link it to the running Lisp image. You can `steal' the bulk of the generic part of your DSL from the existing Lisp: DSL variables become Lisp variables. DSL expressions become Lisp expressions where possible. Your `compiler' is mostly the identity function, and a handful of macros cover the rest.

You can use the means of combination and means of abstraction as provided by Lisp. This saves you a lot of work in designing the language, and it saves the user a lot of work in learning your language (*if* he already knows lisp, that is).

The real `lightweight' DSLs in Lisp look just like Lisp. They *are* Lisp. The `middleweight' DSLs do a bit more heavy processing in the macro expansion phase, but are *mostly* lisp. `Heavyweight' DSLs, where the language semantics are quite different from lisp, can nonetheless benefit from Lisp syntax. They'll *look* like Lisp, but act a bit differently (FrTime is a good example).

You might even say that nearly all Lisp programs are `flyweight' DSLs.

Wednesday, June 6, 2007

Plotting discrete events

A while back I got hit by a spam worm (Sobig) that sent me tens of thousands of emails. I wanted to plot the amount of email I got as a function of time so I could see how the attack evolved over time. The problem with plotting discrete events (the arrival of a spam) is that the closer you zoom in on the problem, the flatter the plot looks. For instance, if I plotted the number of spams arriving with microsecond accuracy, you'd find that in any microsecond interval that you'd have one or zero emails. During the spam attack, I was getting several emails a second, but after the attack was over, I could go for hours without getting a single message. I played around with a number of ways to visualize the rate of email arrival before I hit on this method: along the X axis you plot time, along the Y axis you plot the log of the amount of time since the last message (the log of the interarrival time). This makes a very nice plot which shows rapid-fire events as dark clusters near the X axis and periods of calm as sparse dots above the axis. The scale doesn't flatten out as you zoom in, either: the points plotten remain at the same height as you stretch the X axis. This allows you to `zoom in' on the rapid fire events without causing the graph to `flatten out'. I haven't seen this particular kind of plot used very often, but I have stumbled across it every now and then. For instance, this graph uses the technique. I don't know what it is plotting, but you can see there is definitely more of it on the right-hand side. I'd be interested in hearing if this kind of plot works for others.

Saturday, June 2, 2007

Why I am a knee-jerk anti-loopist.

I've made these points before in different forums, but talking about LOOP has become popular again, so I'll just post a blog entry and point people at it. Here are the reasons I hate LOOP: 1. It is hard to predict what the resulting code will look like after it is expanded. You may *think* you know what it is doing, but you will probably be fuzzy about the details. This leads to bugs. 2. Loop makes it extremely difficult to write meta-level code that can manipulate loop expressions. A MAP expression is just a function call, but LOOP requires a serious state-machine to parse. You can macro-expand the LOOP away, but the resulting expansion is almost as hard to understand as the original. 3. LOOP operates at a low level of abstraction. One very frequent use of iteration is when you wish to apply an operation to a collection of objects. You shouldn't have to specify how the collection is traversed, and you certainly shouldn't have to specify a serial traversal algorithm if it doesn't matter. LOOP intertwines the operation you wish to perform with the minutiae of traversal --- finding an initial element, stepping to the next element, stopping when done. I don't care about these details. I don't care if the program LOOPs, uses recursion, farms out the work to other processors, or whatever, so long as it produces the answer I want. 4. LOOP encourages users to think linearly, sequentially, and iteratively. They start trying to flatten everything into a single pass, first-to-last, state-bashing mess. Suddenly they forget about recursion. I've seen this happen over and over again. It literally damages the thought process. (which is why I'm suspicious of named-let style loops)

Wednesday, May 23, 2007

Added the computer history web site to my All Lisp Search. I'm also checking to see if I can release some old lisp code I wrote.

Monday, May 21, 2007

Why does comp.lang.lisp attract so many kooks, crackpots, cranks and loonies?

Wednesday, May 2, 2007

Put Evil up on the code site. Evil is my C++ to Scheme compiler. It was intended to be a hack to help me translate a pile of nasty C++ code for the Common Larceny project. It is definitely more of a hack than a production compiler, but it did what I wanted.

Thursday, April 26, 2007

I always keep my computer up to date with the latest software, patches, drivers, etc. On Patch Tuesday I get the latest upgrades and install them. There were some new video drivers for my ATI card this week. I was happy to install them because the old ones are at least a year old. But I noticed last night that the video on the laptop LCD display was kind of blurry. I investigated and found that although the display is physically 1400 x 1050, the video was set to 1280x1024. Not only that, but there was no longer the option to set it to the correct resolution. It was obvious what was wrong, but it was not obvious how to fix it. I searched around and couldn't find anything that discussed this particular problem. Eventually I decided that I'd just restore to the save point of a day or two ago. After all, that is what the save points are for, right? In theory, anyway. I tried to do the restore, but each time, after it rebooted the computer several times, it popped up a window telling me that it couldn't restore to the save point and that it just left the machine alone. (At least it only wasted my time.) What is the point of saving periodically if you can't actually use the save? After some more poking around, I finally figured out that I could roll back the device driver update from the device properties. That worked. This is ridiculous, though.

Monday, April 23, 2007

Stupid software

Stupid blog lost a post. How am I supposed to rant? It's bad enough that I feel like I'm talking to myself here, but the damn computer isn't even listening.

Schematics

Schematics for the K-Machine are now available.

Wasteland

I was going to write an essay about tail recursion, and I wanted to demonstrate my point with some actual code. I was going to do it in C# because I happen to have some profiling tools that will help me get some numbers. I wanted a small, portable benchmark-like program that anyone could run and reproduce my results. I suppose I should have known. The world of object-oriented programming is overwhelmingly imperative. Most of the functions I found return `void'. This is the level of sophistication I expect from assembly code! The bulk of the code is still hacking and thrashing at memory locations. Of course these are named and encapsulated memory locations, so I guess that makes it OK.

Wednesday, April 18, 2007

Just spent the day trying to recover from a software update. It seems that `Spyware Doctor' is a new part of `Google Pack' and although it does detect spyware, it is a real dog and resource hog. It also had a habit of crashing on shutdown. What a piece of crap.

Tuesday, April 17, 2007

I'm kinda tired of digging through old LMI stuff, so I'm going to let that rest a while. I was thinking what else I should put up, but work has been extra busy and I'm not feeling inspired today.

Wednesday, April 11, 2007

Friday, April 6, 2007

Added SYS directory to jrm-code-project. Subversion is a piece of crap. It lost the updates to a few of the files when I added them, and then I couldn't get my local machine to sync with the server to reload the files. I had to delete and replace the damn files that shouldn't have been lost in the first place. Hasn't anybody heard of transactions? Hasn't anybody heard of re-synchronizing? Hasn't anybody heard of sanity checking? Good grief. ChangeSafe never had these sorts of problems, and it had special commands for recovery for these problems in case they ever happened.
Added some more code to Bits and Pieces.

Monday, April 2, 2007

Redid my Google Pages. Created all lisp custom search engine.