sboyer.scm
is now at 2.1 seconds (down from the baseline of 6.56 seconds). It wouldn't be too hard to push it below 2 seconds with some more specialization of the conditionals, but that's not where I want to go.
Wednesday, October 28, 2009
update
I've tweaked and optimized things in my interpreter so that the median time for
Wednesday, October 21, 2009
Now that flat environments are working fairly good, the bottleneck has shifted. The top three items on the ‘top of stack’ histogram are the primitive procedures
This is pretty general, and it has to be if we are going to support primitives like
The other half is in evaluating the argument to
Unfortunately, specializing on the primitive procedure and the argument type like this requires a lot of code. Each one-argument primitive can be specialized in at least six different ways, and each way is its own separate class. C# does not have macros and templates don't quite work for this sort of thing. The alternative is some code-generation mechanism, but I've been too lazy to automate that (I've been expanding these things by hand). On the other hand, the unspecialized mechanism is not unreasonable if it isn't called to often, so by specializing only a handful of the top primitives we get a lot of performance improvement for very little work.
MIT-Scheme has reflective operations for manipulating its own SCode. In order to maintain compatiblity with these, the specialized primitives inherit from
Just by optimizing
CAR
, PAIR?
, and NULL?
. Let's look at the code for PrimitiveCombination1
:
public override bool EvalStep (out object answer, ref Control expression, ref Environment environment) { // Evaluate the argument. Control unev0 = this.arg0; Environment env = environment; object ev0; while (unev0.EvalStep (out ev0, ref unev0, ref env)) { }; if (ev0 == Interpreter.UnwindStack) { ((UnwinderState) env).AddFrame (new PrimitiveCombination1Frame0 (this, environment)); answer = Interpreter.UnwindStack; environment = env; return false; } // Call the primitive. if (this.method (out answer, ev0)) { TailCallInterpreter tci = answer as TailCallInterpreter; if (tci != null) { answer = null; // dispose of the evidence // set up the interpreter for a tail call expression = tci.Expression; environment = tci.Environment; return true; } else throw new NotImplementedException (); } else return false; }The method for
CAR
is this:
public static bool PrimitiveCar (out object answer, object arg0) { answer = ((Cons) arg0).Car; return false; }There's quite a bit of noise in that code, so here is the main code path:
bool EvalStep (out object answer, ref Control expression, ref Environment environment) { // Evaluate the argument. Control unev0 = this.arg0; Environment env = environment; object ev0; while (unev0.EvalStep (out ev0, ref unev0, ref env)) { }; if (ev0 == Interpreter.UnwindStack) { ... } // Call the primitive. if (this.method (out answer, ev0)) { ... } else return false; }The
while
statement is the tail-recursion trampoline. The immediately following conditional is there to support first-class continuations. The method is expected to stuff its result in answer
and return false
, unless it needs to make a tail-recursive call, in which case it returns true
.This is pretty general, and it has to be if we are going to support primitives like
call-with-current-continuation
, but the number one primitive procedure is CAR
, and we can handle that one a bit more efficiently. The first thing we need to do is inline the call to CAR
:
bool EvalStep (out object answer, ref Control expression, ref Environment environment) { // Evaluate the argument. Control unev0 = this.arg0; Environment env = environment; object ev0; while (unev0.EvalStep (out ev0, ref unev0, ref env)) { }; if (ev0 == Interpreter.UnwindStack) { ... } // Attempt to cast. Cons theCell = ev0 as Cons; if (theCell == null) { ... enter error handler ... } else { answer = theCell.Car; return false; } }This avoids several operations. We no longer push
ev0
as an argument just to pop it off in the primitive, and we no longer return a flag for a conditional branch. This is about half of the work.The other half is in evaluating the argument to
CAR
. The debug version shows that the argument to CAR
is usually bound in an argument position in the enclosing lambda. If that is the case, then there is no need for the tail-recursion trampoline or the continuation handling code. We can just fetch the argument.
bool EvalStep (out object answer, ref Control expression, ref Environment environment) { // Attempt to cast. Cons theCell = environment.ArgumentValue(this.argumentOffset) as Cons; if (theCell == null) { ... enter error handler ... } else { answer = theCell.Car; return false; } }If the primitive cannot throw an error (for example,
PAIR?
), it is even simpler:
public override bool EvalStep (out object answer, ref Control expression, ref Environment environment) { answer = environment.ArgumentValue (this.offset) is Cons; return false; }This code takes almost a negligable amount of time.
Unfortunately, specializing on the primitive procedure and the argument type like this requires a lot of code. Each one-argument primitive can be specialized in at least six different ways, and each way is its own separate class. C# does not have macros and templates don't quite work for this sort of thing. The alternative is some code-generation mechanism, but I've been too lazy to automate that (I've been expanding these things by hand). On the other hand, the unspecialized mechanism is not unreasonable if it isn't called to often, so by specializing only a handful of the top primitives we get a lot of performance improvement for very little work.
MIT-Scheme has reflective operations for manipulating its own SCode. In order to maintain compatiblity with these, the specialized primitives inherit from
PrimitiveCombination1
. The code for EvalStep
is overridden, but we retain the rest of the class. This allows the Scheme level to reflect on this code as if it were unoptimized code. A good example of this is the pretty printer. When it encounters an optimized PrimitiveCarA
(primitive CAR
of an argument), it treats it just like a PrimitiveCombination1
with an operator of CAR
and an argument of some Variable
.Just by optimizing
CAR
, CDR
, PAIR?
, and NULL?
, the median time for sboyer
drops to 2.62 seconds.
Tuesday, October 20, 2009
Allocating
It is easy to find the side effected variables by tree-walking the body of a lambda expression. If none of the lambda-bound variables are assigned to, then we can create more efficient environment structures at apply time.
Although these environments are quite specialized, they account for the vast majority of environments that are dynamically created. This leads to a good performance increase.
ValueCell
objects at every function application takes a bit of time. It isn't always necessary, either. The point of creating a value cell is so that side-effects on variables have the appropriate sharing semantics. Most variables are not side effected.It is easy to find the side effected variables by tree-walking the body of a lambda expression. If none of the lambda-bound variables are assigned to, then we can create more efficient environment structures at apply time.
StaticEnvironment
s have this structure:
class StaticEnvironment : LexicalEnvironment { readonly ValueCell [] bindings; internal StaticEnvironment (Closure closure, object [] initialValues) : base (closure) { object [] formals = closure.Lambda.Formals; this.bindings = new ValueCell [initialValues.Length]; for (int i = 0; i < initialValues.Length; i++) this.bindings [i] = new ValueCell (formals [i], initialValues [i]); } ... }We define
SimpleEnvironment
s like this:
class SimpleEnvironment : LexicalEnvironment { readonly object [] bindings; internal SimpleEnvironment (Closure closure, object [] initialValues) : base (closure) { this.bindings = initialValues; } ... }Earlier, I posted a table of frame sizes after a long run:
[0] 99656436 [1] 817178031 [2] 219585322 [3] 45556970 [4] 6140170 [5] 2857104 [6] 702372 [7] 448574 [8] 3080 [9] 1 [10] 568 [11] 156 [12] 3 [13] 2 [14] 177 [15] 6More than 99% of the environment frames have three or fewer variables. Instead of holding the variable values in a vector, it is worthwhile to simply enumerate them as fields in the environment object itself. Here is
SmallEnvironment1
:
class SmallEnvironment1 : LexicalEnvironment { readonly object binding0; internal SmallEnvironment1 (Closure, object binding0Value) : base (closure) { this.binding0 = binding0Value; }There are similar classes for
SmallEnvironment0
, SmallEnvironment2
, and SmallEnvironment3
.Although these environments are quite specialized, they account for the vast majority of environments that are dynamically created. This leads to a good performance increase.
sboyer
now takes a median time of 3.504 seconds. It now turns out that variable lookup is not the dominating factor in the performance. I'll discuss the next problem in the next post.
Monday, October 19, 2009
Oh yeah, those flat environments
I did finally get flat environments working the way I want, and then refactored to be simpler and clearer. The basic idea is that rather than chasing down environment frames looking for a binding, we keep the lexical variables in a vector. A closure now looks like this:
Going to flat environments makes a substantial improvement. Our baseline median time for the
Variable lookup is no longer the bottleneck in the interpreter. Procedure application is. It was worth the tradeoff, but let's see what procedure application involves:
class Closure { protected readonly Lambda closureLambda; protected readonly Environment closureEnvironment; protected readonly ValueCell [] staticBindings; ... }When we need the value of a lexical variable, we find it at a precomputed offset in the
staticBindings
. (They are ‘static’ because the location of the binding cell doesn't move.) When we create a closure, we need to copy some of the static bindings from the parent environment. For this we need a StaticMapping
.
class StaticMapping { int [] offsets; .... }The
StaticMapping
is stored in the StaticLambda
from which we construct the StaticClosure
. We copy the bindings when we construct the StaticClosure
.
bool EvalStep (out object answer, ref Control expression, ref Environment environment) { answer = new StaticClosure (this, environment.BaseEnvironment, environment.GetValueCells (this.staticMapping)); return false; }And the code for
GetValueCells
is this:
internal override ValueCell [] GetValueCells (StaticMapping mapping) { int count = mapping.Size; ValueCell [] cells = new ValueCell [count]; for (int index = 0; index < count; index++) { int o = mapping.GetOffset(index); if (o < 0) cells [index] = this.bindings [(-o) - 1]; else cells [index] = this.Closure.StaticCell (o); } return cells; }The
StaticMapping
encodes argument bindings as negative numbers and static bindings as positive numbers. The appropriate cells are copied from the parent environment.Going to flat environments makes a substantial improvement. Our baseline median time for the
sboyer
benchmark was 6.596 seconds. With flat environments, the median time is now 4.346 seconds.Variable lookup is no longer the bottleneck in the interpreter. Procedure application is. It was worth the tradeoff, but let's see what procedure application involves:
bool Apply (out object answer, ref Control expression, ref Environment environment, object [] args) { if (args.Length != this.arity) throw new NotImplementedException (); expression = this.closureLambda.Body; environment = new StaticEnvironment (this, args); answer = null; // keep the compiler happy return true; } internal StaticEnvironment (Closure closure, object [] initialValues) : base (closure) { object [] formals = closure.Lambda.Formals; this.bindings = new ValueCell [initialValues.Length]; for (int i = 0; i < initialValues.Length; i++) this.bindings [i] = new ValueCell (formals [i], initialValues [i]); }The big problem is that we are allocating ValueCells for the argument bindings. We'll deal with this next.
Thursday, October 15, 2009
Short solution
No takers? Oh well. Here's the solution for yesterday's short exercise.
The amount of remote data is small. We only have 10K records and only add a handful a day. This will all fit in memory just fine.
Now imagine that the cache is a bucket with a small hole in it. Over time, as the cache entries become stale, the cache slowly empties. We can calculate a long-term rate at which entries expire. (This isn't actually what happens, though. The entries expire en masse, but let's pretend.) If we continue to fill the bucket at the same rate as the bucket empties, it will always be full. Any slower and the bucket will empty. Any faster and it will overflow.
The remote database can deliver one entry in 150ms, but we don't want to saturate that connection (there are other clients and we presumably want to perform work other than cache refresh). So let's dedicate 2% of the client bandwidth to the cache. If we fetch no more than one entry every 50 * 150ms = 7.5 seconds, we'll remain under 2%. Of course this means that we cannot let the records expire at a rate faster than this. If our cache has 10K records and they expire at a rate of one record every 7.5 seconds, the cache will be empty in 75K seconds, or 20.8 hours. We set the expiration time on an entry at a tad more than that and we're all set. If 20.8 hours is unacceptably stale, we can shorten it by reserving more bandwidth for the cache. There is a limit, though. With a handful of clients each consuming 2%, there would be a small constant load on the server. If we increased each client to consume 10-12%, the server will be spending most of its time servicing client caches.
The amount of remote data is small. We only have 10K records and only add a handful a day. This will all fit in memory just fine.
Now imagine that the cache is a bucket with a small hole in it. Over time, as the cache entries become stale, the cache slowly empties. We can calculate a long-term rate at which entries expire. (This isn't actually what happens, though. The entries expire en masse, but let's pretend.) If we continue to fill the bucket at the same rate as the bucket empties, it will always be full. Any slower and the bucket will empty. Any faster and it will overflow.
The remote database can deliver one entry in 150ms, but we don't want to saturate that connection (there are other clients and we presumably want to perform work other than cache refresh). So let's dedicate 2% of the client bandwidth to the cache. If we fetch no more than one entry every 50 * 150ms = 7.5 seconds, we'll remain under 2%. Of course this means that we cannot let the records expire at a rate faster than this. If our cache has 10K records and they expire at a rate of one record every 7.5 seconds, the cache will be empty in 75K seconds, or 20.8 hours. We set the expiration time on an entry at a tad more than that and we're all set. If 20.8 hours is unacceptably stale, we can shorten it by reserving more bandwidth for the cache. There is a limit, though. With a handful of clients each consuming 2%, there would be a small constant load on the server. If we increased each client to consume 10-12%, the server will be spending most of its time servicing client caches.
Wednesday, October 14, 2009
Short exercise
I seem to have solved the bug in my flat environment code. I'll post on that in a day or two.
An interesting problem came up at work a couple of days ago, and it makes for a good engineering problem. Let me see if I can describe this in an interesting way.
A database application generates charts that describe certain things about the data. For instance, you might generate a chart of a moving average of field
A cache was added in order to make things remotely bearable. The database changes relatively slowly (a handful of records a day), and the charts are not designed for pinpoint accuracy, so there is no requirement that the cache be completely fresh or for it to provide a transaction consistent view of the data. A very simple cache that remembers the data for a few hours is acceptable. This alleviated a lot of the pain because now users could generate different variations of their charts and compare them in real time.
But there is still the problem with the ‘startup transient’. If no one has recently generated a chart, that first one you make will take forever as the cache is loaded. So the problem is getting rid of the startup transient.
I'll give the following hints:
An interesting problem came up at work a couple of days ago, and it makes for a good engineering problem. Let me see if I can describe this in an interesting way.
A database application generates charts that describe certain things about the data. For instance, you might generate a chart of a moving average of field
X
for all entries where the username is ‘jmarshall’. This is pretty standard stuff for a simple database. There's a problem, however. For various reasons, some of the information we like to plot is stored elsewhere and we have to fetch it. This is a heavyweight request that takes 150ms. We have on the order of 10K records, and we typically examine several hundred to create a chart. If we need the remote data, we're looking at several hundred of these heavyweight requests. A big chart can take two and a half minutes to draw and the user gets very impatient.A cache was added in order to make things remotely bearable. The database changes relatively slowly (a handful of records a day), and the charts are not designed for pinpoint accuracy, so there is no requirement that the cache be completely fresh or for it to provide a transaction consistent view of the data. A very simple cache that remembers the data for a few hours is acceptable. This alleviated a lot of the pain because now users could generate different variations of their charts and compare them in real time.
But there is still the problem with the ‘startup transient’. If no one has recently generated a chart, that first one you make will take forever as the cache is loaded. So the problem is getting rid of the startup transient.
I'll give the following hints:
- Assume memory is not an issue.
- Assume multiple writers to the database and the remote data. You will not get ‘update’ notifications.
- Several instances of the application may be running at once, but they cannot communicate with each other. (So an application must not saturate the database with requests.)
- Fetching part of a record or the additional info is no cheaper than fetching the whole thing, so a timestamp or checksum comparison will not be faster.
- The data does not have to be transactionally consistent.
- The data can be somewhat stale, but there should be a limit.
Monday, October 12, 2009
grr... bugs
So as I'm trying to write out the details of how the flat environments work, I had an idea on how to optimize top-level references. While I was working on that, I discovered problems in the mapping tables that are carried around in the partial environments. Now I'm debugging those.
Kbob asks:
Kbob asks:
In your profiling, have you measured the distribution of frame sizes?Indeed I have. Here is a histogram of frame sizes after a 10 billion evaluation run:
Frame Size Count [0] 99656436 [1] 817178031 [2] 219585322 [3] 45556970 [4] 6140170 [5] 2857104 [6] 702372 [7] 448574 [8] 3080 [9] 1 [10] 568 [11] 156 [12] 3 [13] 2 [14] 177 [15] 6
Sunday, October 11, 2009
To make a short story a bit longer
Optimizing argument lookup helps quite a bit, but the non-argument variables take a lot of time. We waste time by walking the environment chain and scanning the lambda arguments of each frame. There are a couple of ways of speeding this up. The first thing is to notice that if there are no incrementals in any of the frames in the lookup path, then the location of a lexical variable is fixed relative to the current environment. You can use a lexical address of the form n-frames-back by offset in frame. This gives you a lot of bang for the buck, and is what the MIT-Scheme interpreter used to do. (Now that they have a compiler, they just punt and deep search each time.) jrmscheme did the same thing, but despite the optimizations, it simply takes time to walk up the environment chain. Not only is time a problem, the storage requirements are an issue. There are bindings that are no longer in use in the deeper frames, and these are being retained along with the bindings of interest. I decided to change to a ‘flat’ environment implementation.
A flat environment does not have a chain of frames. It has a single frame with two components. The binding vector that is common to all environment structures, and a vector of pointers to the lexically visible binding cells. The lexically visible cells are indirect because the bindings may be shared among several closures. For example, look at this code:
The first step was to add a new phase to the interpreter. A partial evaluation walk is done on the code before calling the actual
When we partially evaluate a lambda expression, we first have to check whether we'll need a first-class environment for it. Fortunately, this is an easy test: we just have to walk the body of the expression and look for a call to the
The entire process of partial evaluation is very similar to that of ‘normal’ evaluation, but there are a couple of differences. Partial evaluation produces a
A
But if we don't need a first-class environment, we do things a bit differently. The lambda expression becomes a
Alas, I've arrived at work. More details later...
A flat environment does not have a chain of frames. It has a single frame with two components. The binding vector that is common to all environment structures, and a vector of pointers to the lexically visible binding cells. The lexically visible cells are indirect because the bindings may be shared among several closures. For example, look at this code:
(let ((counter 0) (increment 1)) ... (lambda () (set! counter (+ counter increment))) (lambda () counter) ...)We have two lambda expressions that must be closed over the same variable
counter
. In a chained environment model, invoking either closure would extend the common shared environment where the counter
and increment
variables were bound. In a flat environment model, invoking the second closure would lead to an environment structure like this:bindings: #() ;; empty argument bindings lexical: #(<pointer to counter>)while invoking the first closure would lead to this:
bindings: #() ;; empty argument bindings lexical: #(<pointer to counter> <pointer to increment>)This is fairly straightforward in principle, but the devil is in the details. It took me a fair amount of time to fiddle around with the code and come up with something that worked.
The first step was to add a new phase to the interpreter. A partial evaluation walk is done on the code before calling the actual
eval
. In the partial evaluation phase, the code tree is rebuilt and certain nodes in the tree are replaced. A partial environment is kept as the code is walked. When a variable is encountered, it is looked up in the partial environment. If it is found in the immediately enclosing frame, it is turned into an Argument
variable with the appropriate offset. If it is not found at all, it is turned into a FreeVariable
. If it is found in an intermediate frame, things get tricky. We have to consider the lambda expressions that are associated with the intervening frames and decide how the variable will be accessed.When we partially evaluate a lambda expression, we first have to check whether we'll need a first-class environment for it. Fortunately, this is an easy test: we just have to walk the body of the expression and look for a call to the
the-environment
special form. If we need a first-class environment, then we simply punt on any optimization other than Argument
variables because we'll just do a deep search every time we access a lexical variable. On the other hand, if we don't need a first-class environment, we construct a PartialStaticClosure
for the lambda expression. When we partially evaluate the PartialStaticClosure
, we construct a PartialStaticEnvironment
that we use for partially evaluating the body of the lambda expression.The entire process of partial evaluation is very similar to that of ‘normal’ evaluation, but there are a couple of differences. Partial evaluation produces a
PartialResult
that contains the rebuilt code tree. Conditionals are partially evaluated along both branches and the PartialResult
s are combined into a new conditional node. Partial closures are partially evaluated right away (instead of being returned) as if they had been applied to arguments, but of course the resulting PartialEnvironment
doesn't contain runtime values, it only contains binding information.A
PartialClosure
leads to the construction of a PartialEnvironment
. Variable lookup in a PartialEnvironment
either returns the argument offset, or an indication that a deep search is necessary. The rebuilt lambda expression becomes a StandardLambda
, when it is actually evaluated, a StandardClosure
is built. When the StandardClosure
is applied to arguments, the bindings are placed in a StandardEnvironment
. Variable lookup in a StandardEnvironment
uses deep search, and in this way retain the old chained environment model for first-class environments when necessary.But if we don't need a first-class environment, we do things a bit differently. The lambda expression becomes a
StaticLambda
(the word static in this case meaning that the location of the bindings never change). When a StaticLambda
is partially evaluated, a PartialStaticClosure
is created. This leads to the construction of a PartialStaticEnvironment
where variable lookup happens differently. At regular eval time, a StaticLambda
creates a StaticClosure
, and when a StaticClosure
is applied, it creates a StaticEnvironment
.Alas, I've arrived at work. More details later...
Friday, October 9, 2009
To make a long story short
I was mulling over a bunch of ideas about benchmarking, experimentation, and the philosophy of science. You'll be happy to hear that I'll spare you the details. The end result is this: The sboyer
benchmark when given a argument of 1 (591777 rewrites) runs with a median time of 6.596 seconds on jrmscheme. No optimizations are turned on, but the code has been run through SF with
So let's start with the optimizations. I have a lot of code that instruments the interpreter when I run in debug mode. According to the instrumentation, from the initial REPL prompt, through loading and running the benchmark, printing the results and halting at the next prompt, there are 102,320,483 evaluations. The breakdown is this:
If we keep track of the lexical depth of the variables (that is, how many environment frames we had to search), we find this distribution:
The vast majority of the variables are found in the topmost frame.
Recall the code for variable lookup:
These two tiny changes have a big effect on the running time. Our median time is now 5.488 seconds, or 0.832 times the original time.
We're by no means done with this sort of optimization, but I'll get to that soon...
(declare (usual-integrations))
.
This is pretty lackluster, but there is no place to go but up!So let's start with the optimizations. I have a lot of code that instruments the interpreter when I run in debug mode. According to the instrumentation, from the initial REPL prompt, through loading and running the benchmark, printing the results and halting at the next prompt, there are 102,320,483 evaluations. The breakdown is this:
Variable 39028733 PrimitiveCombination1 20796045 Conditional 13248184 PrimitiveCombination2 6294095 Quotation 6148442 Combination2 4155507 Combination1 3576004 Sequence2 2572345 StandardLambda 2313932 Assignment 1666782 Combination 1420518 Disjunction 1087864 PrimitiveCombination3 6873 Sequence3 4451 Access 388 PrimitiveCombination0 295 Definition 22 Comment 2 StandardExtendedLambda 1Variable lookup is the number one item, and optimizing it will help tremendously.
If we keep track of the lexical depth of the variables (that is, how many environment frames we had to search), we find this distribution:
The vast majority of the variables are found in the topmost frame.
Recall the code for variable lookup:
object DeepSearch (Symbol name) { int offset = this.envClosure.FormalOffset (name); if (offset == -1) { if (this.incrementalDefinitions.ContainsKey(name)) return this.incrementalDefinitions[name]; else return this.envClosure.Environment.DeepSearch (name); } return this.bindings [offset]; }Most of the time, the
offset
is a non-negative index into the binding array. But the offset is determined by the position of the name in the lambda expression that binds the variable, and that can be determined before we evaluate the code. So we'll introduce a new SCode type Argument
that is the same as Variable
except that it also contains the offset. When we construct a lambda
expression, we'll walk the scode body and replace references to the lambda parameters with Argument
structures. We only replace the references if they occur at the same binding level. If there is an intermediate lambda expression, we cannot replace the references. We can then add an ArgumentValue
method to our environment objects:
object ArgumentValue (int offset) { return this.bindings[offset]; }The
EvalStep
of a Variable
looks like this:
public override bool EvalStep (out object answer, ref Control expression, ref Environment environment) { answer = environment.DeepSearch (this.varname); return false; }The
EvalStep
for an Argument
will be like this:public override bool EvalStep (out object answer, ref Control expression, ref Environment environment) { answer = environment.ArgumentValue(this.offset); return false; }There is one more change. If the variable is not an
Argument
then we know one thing for sure — we won't find it in the bindings of the topmost environment. It may be added later as an incremental, but it certainly won't be in the formal parameters and it is pointless to search there. We'll save time by avoiding that search:
object NonArgumentValue (Symbol name) { // Skip the formals of this frame. if (this.incrementalDefinitions.ContainsKey(name)) return this.incrementalDefinitions[name]; else return this.envClosure.Environment.DeepSearch (name); } object DeepSearch (Symbol name) { int offset = this.envClosure.FormalOffset (name); if (offset == -1) { if (this.incrementalDefinitions.ContainsKey(name)) return this.incrementalDefinitions[name]; else return this.envClosure.Environment.DeepSearch (name); } return this.bindings [offset]; }Basically we unrolled the loop by one and tossed out the dead code.
These two tiny changes have a big effect on the running time. Our median time is now 5.488 seconds, or 0.832 times the original time.
We're by no means done with this sort of optimization, but I'll get to that soon...
Establishing a baseline
It turned out to be harder to establish a baseline than I expected. I decided to use the
One problem is to find a benchmark that runs long enough to be interesting, but not so long as to be tedious. When I'm on the shuttle bus running my laptop in power-saver mode with jrmscheme running under the debugger,
I know that computers these days exhibit variability in benchmark timings, but the amount of variation was higher than I hoped for. In two back-to-back runs I got timings of 18.81 seconds in the first, 14.03 in the second (no, it wasn't a `warm cache'. The third run was 17.1 seconds). This means that I have to perform a fair number of runs to characterize the performance.
This got me thinking, so I'm going to cut this post short. I'll have more to say later...
sboyer
benchmark because it a ‘classic’ and it does a bunch of lispy stuff. If I were writing a paper or something I'd use a whole bunch of benchmarks, but I'm just doing this casually.One problem is to find a benchmark that runs long enough to be interesting, but not so long as to be tedious. When I'm on the shuttle bus running my laptop in power-saver mode with jrmscheme running under the debugger,
sboyer
can take almost twelve minutes to complete. But when I'm at home running the same laptop in ‘performance’ mode with an optimized compile of jrmscheme outside of the debugger, it takes about six seconds. Benchmarking under the debugger is pointless, of course, but when I'm running in debug mode I have a lot of extra code instrumenting the interpreter, and this is useful to see exactly which optimizations are doing what at the low level.I know that computers these days exhibit variability in benchmark timings, but the amount of variation was higher than I hoped for. In two back-to-back runs I got timings of 18.81 seconds in the first, 14.03 in the second (no, it wasn't a `warm cache'. The third run was 17.1 seconds). This means that I have to perform a fair number of runs to characterize the performance.
This got me thinking, so I'm going to cut this post short. I'll have more to say later...
Thursday, October 8, 2009
Obvious idea
This is obvious, but interesting nonetheless.
The semantics of a programming language are a mathematical model of the language. The semantics relate expressions in the programming language to mathematical objects that (presumably) are well-understood. Denotational semantics attempt to find a mapping between programs and mathematical objects (the recursive partial functions). Operational semantics attempt to describe mathematically each step of the computation as it evolves.
But we can look at this another way: the denotational semantics translate the program to a mathematical structure, the operational semantics mimic the program via mathematical structures. In other words, we can consider mathematics to be a target language and say this:
This means we can turn the above statements around:
Some proposed additions to Scheme or Lisp (fexprs or ubiquitous first-class environments for example) pretty much make it impossible to compile code. This is because the user can come along after the fact and change the meaning of the code in a way that the compiler could not possibly forsee (by, for example, a fexpr rewriting the source code or someone injecting a new binding in an environment). If you cannot compile the code, you cannot develop a non-trivial denotational semantics for it, either. (The semantics would trivially be the limit as n goes to infinity of S(n) where S(n) is n steps of the operational semantics, provided the limit exists. In other words, the only way to determine the meaning of the program is to run it and see what happens.)
The semantics of a programming language are a mathematical model of the language. The semantics relate expressions in the programming language to mathematical objects that (presumably) are well-understood. Denotational semantics attempt to find a mapping between programs and mathematical objects (the recursive partial functions). Operational semantics attempt to describe mathematically each step of the computation as it evolves.
But we can look at this another way: the denotational semantics translate the program to a mathematical structure, the operational semantics mimic the program via mathematical structures. In other words, we can consider mathematics to be a target language and say this:
- Denotational semantics ‘compiles’ a program into math.
- Operational semantics ‘interprets’ a program in math.
This means we can turn the above statements around:
- A compiler is a denotational semantics from the source language to the target language.
- An interpreter is an operational semantics from the interpreted language to the implementation language.
Some proposed additions to Scheme or Lisp (fexprs or ubiquitous first-class environments for example) pretty much make it impossible to compile code. This is because the user can come along after the fact and change the meaning of the code in a way that the compiler could not possibly forsee (by, for example, a fexpr rewriting the source code or someone injecting a new binding in an environment). If you cannot compile the code, you cannot develop a non-trivial denotational semantics for it, either. (The semantics would trivially be the limit as n goes to infinity of S(n) where S(n) is n steps of the operational semantics, provided the limit exists. In other words, the only way to determine the meaning of the program is to run it and see what happens.)
Incremental definitions
Kbob asked me:
Incremental definitions. Just to be clear, you mean symbols defined inside a lambda?
(lambda (a b)
(define c ...)
...)
c is an incremental definition?
That's a really good question. In Scheme, internal definitions are transformed into a
We only create environments when we evaluate lambda expressions, and all the necessary variables should be in the parameter list. The only way to add a binding to an environment that wasn't in the parameter list when the environment was created is to evaluate code that wasn't there when the environment was created. There is really only one way to do this, and that is to use
There are . different strategies for dealing with incrementals:
Back in the day, MIT Scheme chose option 5. The primitive procedure
By the time the MIT Scheme compiler was written, however, it was realized that arbitrary evaluation in any environment had more disadvantages than advantages, so the MIT Scheme compiler uses option 4. If you write code that uses
One of the fun things about Lisp and Scheme is exploring the basement. MIT-Scheme has ‘subprimitives’ that directly manipulate the underlying memory. If you don't know what you're doing, you can easily corrupt memory and crash the system, but the system uses these to bootstrap itself. In the cold load sequence for MIT Scheme there is this interesting function:
My version of MIT-Scheme (call it jrm-scheme), interprets the MIT-Scheme SCode without modification, so it boots and runs with the code above. By default, I have to build environment structure that is compatible with the MIT-Scheme interpreter because the Scheme code sometimes examines the structure reflectively. But the point wasn't to make a slavish re-implementation, but to explore the implementation possibilities under real-world constraints. So the next few posts are going to discuss I implemented environments.
(lambda (a b)
(define c ...)
...)
c is an incremental definition?
That's a really good question. In Scheme, internal definitions are transformed into a
letrec
form, so the example above would be turned into:
(lambda (a b) (letrec ((c <compute value for c>)) <body>)) => (lambda (a b) (let ((c <unbound>)) (let ((temp <compute value for c>)) (set! c temp)) <body>)) => (lambda (a b) ((lambda (c) ((lambda (temp) (set! c temp)) <compute value for c>) <body>) <unbound>))So internal defines will end up as lambda variables.
We only create environments when we evaluate lambda expressions, and all the necessary variables should be in the parameter list. The only way to add a binding to an environment that wasn't in the parameter list when the environment was created is to evaluate code that wasn't there when the environment was created. There is really only one way to do this, and that is to use
eval
.
Although eval
itself is not commonly used, the read-eval-print loop
and load
are. Both of these need to use eval
(or an equivalent).There are . different strategies for dealing with incrementals:
- Disallow
eval
- Use a “closed-world” model in which code cannot be evaluated and loaded at runtime. There can be no incrementals in this model. A REPL would have to be implemented as a meta-circular evaluator. - Restrict
eval
- Do not permitdefine
expressions to be evaluated. A REPL andload
would be a problem, but this could be an option in a limited debugger. - Restrict access to environments - There are certain distinguished standard environments that can be used with
eval
. These can be specially constructed to support incremental definitions. If there is no mechanism for gaining access to the environments created by applying a closure, then <normal> environments would not need incrementals. - Support
the-environment
- Early versions of Scheme had the special formthe-environment
that would return the current environment to the user as a first-class object. The returned environment (and all the intermediate environments up to the global environment) would have to support incremental definitions, but otherwise they would not be necessary. Fortunately, it is simple to examine the code ateval
time to see if there is a call tothe-environment
within it. If there is not, then there is no need for incrementals. - Go wild - Have a primitive procedure that can extract the environment from an arbitrary closure object and allow this environment to be passed to
eval
. All environments must support incremental definitions because there is no way to predict if they would be necessary.
Back in the day, MIT Scheme chose option 5. The primitive procedure
closure-environment
would extract the environment object from a closure, and you could call eval
with that object. The special form the-environment
was also supported. Unfortunately, this means that all environments must be constructed in such a way that they can be manipulated by the interpreter. Furthermore, it means that all variable lookup must be done by deep searching the environment chain.By the time the MIT Scheme compiler was written, however, it was realized that arbitrary evaluation in any environment had more disadvantages than advantages, so the MIT Scheme compiler uses option 4. If you write code that uses
the-environment
, the compiler will invoke the interpreter on that code. (This is so the compiler doesn't have to know anything about the interpreter implementation except the entry point. You don't want to have to maintain two separate compatible copies of the environment code.) If you don't use the-environment
, the compiler is free to do what it wants. The closures created by the compiler cannot be destructured with closure-environment
, but the compiler does emit debugging information to allow you to inspect what is left of the environment (if anything) once the compiler has optimized the code. The MIT-Scheme debugger uses option 2 to somewhat simulate the effect of evaluating code within the debugger.One of the fun things about Lisp and Scheme is exploring the basement. MIT-Scheme has ‘subprimitives’ that directly manipulate the underlying memory. If you don't know what you're doing, you can easily corrupt memory and crash the system, but the system uses these to bootstrap itself. In the cold load sequence for MIT Scheme there is this interesting function:
(define (*make-environment parent names . values) ((ucode-primitive system-list-to-vector) (ucode-type environment) (cons ((ucode-primitive system-pair-cons) (ucode-type procedure) ((ucode-primitive system-pair-cons) (ucode-type lambda) unspecific names) parent) values)))This creates a first-class environment structure with names and values by constructing tagged pointers to raw data. It is constructed to appear as if it were created by invoking a lambda expression with an unspecific body. This is used to construct the initial top-level environments for the REPL. In packag.scm, you'll find this:
(define null-environment ((ucode-primitive object-set-type) ((ucode-primitive object-type) #f) (fix:xor ((ucode-primitive object-datum) #F) 1)))This creates a magic object that is recognized as the root of an environment chain.
My version of MIT-Scheme (call it jrm-scheme), interprets the MIT-Scheme SCode without modification, so it boots and runs with the code above. By default, I have to build environment structure that is compatible with the MIT-Scheme interpreter because the Scheme code sometimes examines the structure reflectively. But the point wasn't to make a slavish re-implementation, but to explore the implementation possibilities under real-world constraints. So the next few posts are going to discuss I implemented environments.
Wednesday, October 7, 2009
Details
Here's how I'm handling environments and variables in my version of MIT-Scheme. You can skip this post if you don't care about implementation details.
If you remember your beginning Scheme course, you know that an environment consists of a series of ‘frames’ where each frame contains a table of variable bindings and a pointer to the parent frame. When you look up a variable, you search for the binding starting at the nearest frame and work your way up to the root frame (the global environment, or some top-level environment). At least that's the model. Reality is a bit more complicated.
When you apply a closure to some arguments, you create a new environment frame that contains the bindings of the lambda-expression associated with the closure. The frame for a standard environment has a pointer to the closure that was invoked and a vector of value cells for the arguments.
Variable lookup is simple, but tedious. First we scan the parameter list of the lambda. If the variable name is found, the offset in the lambda list will match the offset in the bindings vector, so we just go fetch it. Otherwise, we search the next frame. This is almost the correct code:
At this point it is worth running a couple of benchmarks to establish baseline performance.
If you remember your beginning Scheme course, you know that an environment consists of a series of ‘frames’ where each frame contains a table of variable bindings and a pointer to the parent frame. When you look up a variable, you search for the binding starting at the nearest frame and work your way up to the root frame (the global environment, or some top-level environment). At least that's the model. Reality is a bit more complicated.
When you apply a closure to some arguments, you create a new environment frame that contains the bindings of the lambda-expression associated with the closure. The frame for a standard environment has a pointer to the closure that was invoked and a vector of value cells for the arguments.
class StandardEnvironmentFrame { StandardClosure closure; ValueCell [] bindings; } class StandardClosure { StandardLambda lambdaExpression; StandardEnvironmentFrame environment; } class StandardLambda { Symbol [] parameterList; SCode body; }Notice the
StandardEnvironmentFrame
is missing both the table of bindings and the pointer to the parent frame. You can get the parent frame by dereferencing the closure and dereferencing its frame. We're only storing the binding cells in the environment, but you can figure out what names are associated with them by looking at the parameter list of the lambda expression (again, via the closure). Why the roundabout way? Two reasons. It is parsimonious in storage, and there is a legacy reason. Suppose you are evaluating a function call expression. If you are using a stack machine, the natural strategy is to evaluate each subexpression in turn and push the value on the stack. If you do this in the right order, you'll have the closure and the argument values in a contiguous chunk of stack. This is (almost) exactly the layout of the environment frame.Variable lookup is simple, but tedious. First we scan the parameter list of the lambda. If the variable name is found, the offset in the lambda list will match the offset in the bindings vector, so we just go fetch it. Otherwise, we search the next frame. This is almost the correct code:
object DeepSearch (Symbol name) { int offset = this.envClosure.FormalOffset (name); if (offset == -1) { return this.envClosure.Environment.DeepSearch (name); } return this.bindings [offset]; }It's not quite correct because we didn't take incremental definitions into account. The environment of the closure could be a first-class environment. If this is the case, the user could evaluate a new definition in the environment. The StandardFrame needs a place to store these extra definitions.
class StandardEnvironmentFrame { StandardClosure closure; ValueCell [] bindings; DictionaryAnd we have to modifyincrementalDefinitions; }
DeepSearch
to check the incrementals.
object DeepSearch (Symbol name) { int offset = this.envClosure.FormalOffset (name); if (offset == -1) { if (this.incrementalDefinitions.ContainsKey(name)) return this.incrementalDefinitions[name]; else return this.envClosure.Environment.DeepSearch (name); } return this.bindings [offset]; }As I've mentioned before, the speed of variable lookup is one of the most important factors that determines interpreter performance. The performance of the naive code above is pretty bad. In debugging mode, I sample the top of stack every few microseconds and record what the interpreter is doing. The number one item on the top of the stack is variable lookup.
At this point it is worth running a couple of benchmarks to establish baseline performance.