Wednesday, January 27, 2016

Latest amusement

Here's a procedure that CDRs down a list until it hits the end:
(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))
        tail)))
The inner loop of the procedure compiles to this:
loop-3:
 (cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))

 (mov q (r 0) (@r 4))         ;; Fetch 'tail'

 (mov q (r 1) (r 0))          ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))         ;; Check for pair?, jump if not
 (jne (@pcr label-14))

 (and q (r 0) (r 5))          ;; Mask the tag

 (mov q (r 1) (@ro 0 8))      ;; Fetch the CDR

 (mov q (@r 4) (r 1))         ;; Assign 'tail'

 (jmp (@pcr loop-3))          ;; Do it again.
On my laptop, each iteration of this loop takes a few nanoseconds.

I noticed that there seems to be a lot more going on than CDRing down a list. The interrupt check is responsible for a lot of the overhead. MIT/GNU Scheme polls for interrupts. The compiler inserts an interrupt check at every procedure entry and backwards branch. This ensures that only a small, bounded number of instructions can run between interrupt polls. The interrupt poll and the heap check are simultaneous because of a hack. To interrupt Scheme, you save the "Top of Heap" pointer and set it to zero. This causes the heap check to fail as if there were an out of memory condition. The out of memory handler first checks to see if this was actually an interrupt masquerading as an out of memory, and restores the heap pointer if so.

The two instructions,

(cmp q (r 7) (@r 6))         ;; Interrupt check
 (jge (@pcr interrupt-12))
perform the check. The first compares register 7 with the contents of memory that register 6 points at. The convention for the compiler is that register 7 contains the free pointer and register 6 holds the address of MEMTOP. The interrupt check does a memory cycle.

The reason you poll for interrupts is that you need the virtual machine to be in a coherent state because the interrupt could attempt to run arbitrary code. The garbage collector in particular has to be able to parse the machine state
at an interrupt. Interrupts are polled at apply time. Before application, the interrupts are checked. If an interrupt is to be processed, a continuation is pushed that will do the original application, and an interrupt handler is applied instead.

In MIT-Scheme, the virtual machine is a stack machine, and at apply time the entire state of the virtual machine is pointed to by the stack pointer. This is a good state to be in when you handle interrupts or garbage collect. But this means that arguments are passed on the stack. This instruction:

(mov q (r 0) (@r 4))
loads register 0 with the contents of memory at register 4. (The compiler convention is that register 4 is the stack pointer.) In other words, it is fetching the value of the argument tail from the stack. This is the second memory cycle.

Between interrupt polls, the compiled code can freely manipulate Scheme objects without worrying about being embarrassed by the garbage collector. But at apply time, when a possible interrupt poll could happen, the compiled code must put the virtual machine back into a coherent state. In particular, modified values are stored back on the stack.I This instruction just before the jump puts the value of tail back on the stack before jumping to the top of the loop.

(mov q (@r 4) (r 1)) 
That's three memory cycles in addition to the actual fetching of the CDR in this instruction:
(mov q (r 1) (@ro 0 8))
Of course these are quite inexpensive memory cycles because the top of stack is cached, but there is at least the time time it takes to validate the cache entry.

The interrupt poll occurs every time around the loop, so we copy the arguments back and forth between the stack and registers on each iteration. If we unroll the loop we can avoid some of the copying:

(define (last-cdr list)
  (let loop ((tail list))
    (if (pair? tail)
        (loop (cdr tail))
        tail)))

(define (last-cdr2 list)
  (let loop ((tail list))
    (if (not (pair? tail))
        tail
        (let ((tail (cdr tail)))
          (if (not (pair? tail))
              tail
              (loop (cdr tail)))))))
The loop in last-cdr2 compiles to this:
loop-6:
 (cmp q (r 7) (@r 6))           ;; Interrupt check
 (jge (@pcr interrupt-15))  

 (mov q (r 0) (@r 4))           ;; Fetch 'tail'

 (mov q (r 1) (r 0))            ;; Extract the tag
 (shr q (r 1) (&u #x3a))

 (cmp b (r 1) (&u 1))           ;; Check for pair?
 (jne (@pcr label-17))

 (and q (r 0) (r 5))            ;; Mask the tag

 (mov q (r 1) (@ro 0 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 1))           ;; Assign 'tail'

 (mov q (r 0) (r 1))
 (shr q (r 0) (&u #x3a))        ;; Extract the tag

 (cmp b (r 0) (&u 1))           ;; Check for pair?
 (jne (@pcr label-19))

 (and q (r 1) (r 5))            ;; Mask the tag

 (mov q (r 0) (@ro 1 8))        ;; Fetch the CDR

 (mov q (@r 4) (r 0))           ;; Assign 'tail'

 (jmp (@pcr loop-6))            ;; Do it again
If I count correctly, there are six memory cycles per iteration, two of which are CDRs. We also avoid a single jmp instruction per iteration.

Here's the timing on my machine:

(test-last-cdr)
3.643 nanoseconds per cdr
3.643 nanoseconds per cdr
3.711 nanoseconds per cdr
3.643 nanoseconds per cdr
3.576 nanoseconds per cdr

(test-last-cdr2)
2.766 nanoseconds per cdr
2.699 nanoseconds per cdr
2.834 nanoseconds per cdr
2.699 nanoseconds per cdr