Monday, April 13, 2009

Taking a header

It took a while to iron out the contract between the stack closure implementation and the garbage collector. Timing was a critical factor, and memory integrity bugs have a special ability to survive one iteration of garbage collection. The bug confuses the garbage collector on the first iteration and causes it to corrupt memory as it is collecting, but the collector doesn't notice it at this point. On the next garbage collection it encounters the memory corruption and craps out. The problem is that the source of the corruption is long, long gone.

When Ken Sinclair left LMI to go to Symbolics, I inherited the garbage collector. As I mentioned in an earlier post, the garbage collector tended to get blamed because it discovered memory corruption caused by other modules. It was simply the case that if you ran the collector often enough, you'd eventually trigger a collection when someone wasn't expecting it and cause a crash. My microcode tracer cause the GC to run a lot, so I saw most of the errors.

Here's a good example. In uc-lambda-array.lisp you'll find this microcode:
((m-a) m-array-pointer)
((m-b) q-pointer m-array-header) ;Don't put headers in accumulators!
((m-d) m-array-rank)
(m-e) m-array-origin)
((m-s) m-array-length)
This subroutine copies the pointer, header, rank, origin, and length of an array to the m-a, m-b, m-d, m-e, and m-s registers. (The popj-after-next is the microcode subroutine return. You always put it one instruction ahead of when the return actually happens because of pipelining.) The point of this copying is to cache the information needed by the most recent array operation. If the next array operation uses the same array, it runs much faster because it doesn't have to go to main memory.

Note that second line with the comment. An array header is the marker that appears before the array contents. It is used for various things, but one of the most important is that it contains a count of contiguous words of memory that follow. When the garbage collector is scavenging memory to relocate pointers, it checks the array header to see if the array contains ‘boxed’ pointers, or ‘unboxed’ immediate values. If the array is unboxed, the scavenger skips the block of memory containing the array contents.

Enter the law of unintended consequences.

If you browse the microcode, you'll see comments about ‘sequence breaks’. A sequence break is point where the microcode is allowed to transfer control to a different stack-group. The Lisp machine simulated multitasking through time-division multiplexing. When a sequence break occurred, the current state of the micro-machine was saved in a main memory data structure. The accumulators (the m registers with single letter names) are stored in the data structure. If you were very unlucky, you could take a sequence break right after performing an array operation on an unboxed array. The array header would then be stored in the data structure. This wasn't very likely to happen because the m-b accumulator is one of the most frequently used registers and the array header would usually be overwritten within microseconds.

If you didn't turn on the garbage collector, this rare occurrence would be irrelevant. If you did turn on the garbage collector, it would be mostly irrelevant unless you happened to be very unlucky indeed. If you had a sequence break immediately after performing an array operation on an unboxed array, and the garbage collector decided at that point to scavenge the page that contained the saved m-b accumulator, then it would skip over some of the following pointers.

You might be surprised to learn that even that doesn't usually lead to memory corruption. You had to be really, really unlucky. If control is transferred back to this stack-group, the registers are re-loaded from memory. The memory system enforced a read barrier to support incremental garbage collection, so the pointers in the registers would have been adjusted at that point. The problem only occurs when you
  1. Perform an array operation on an unboxed array.
  2. Immediately swap out the stack.
  3. While the stack is swapped out, start and complete a garbage collection. Some registers (the ones stored in memory locations after where m-b is store) may contain bad pointers at this point.
  4. Start another garbage collection before re-assigning the registers.
This long chain of events is typical of the bugs I encountered. They were quite hard to reproduce. This bug had an easy fix. When we strip the data type off the array-header, the garbage collector no longer recognizes the object as a special case and simply steps over it to the next word in memory.

There were at most a couple dozen of these sort of bugs in the microcode. They were a real bear to find, but over the course of several months I eliminated a lot of them. Within the year you could expect to run the garbage collector without crashing the machine.

No comments: