Abstract Heresies: Differential Benchmarking

I wanted to find out the fastest way to return multiple values in the Microsoft CLR. In order to measure this, I needed to have two pieces of code that `do the same thing' except for the way values are returned. For a baseline, I used the Takeuchi function:

        static int Tak (int x, int y, int z)
        {
            return (!(y < x))
                ? z
                : Tak (Tak (x - 1, y, z),
                       Tak (y - 1, z, x),
                       Tak (z - 1, x, y));
        }

Tak (18, 12, 6) makes 63609 function calls and 47706 subtractions.

This version uses an out variable to return the value:

        void TakOut (out int answer, int x, int y, int z)
        {
            if (!(y < x)) {
                answer = z;
                return;
            }
            int x1;
            int y1;
            int z1;
            TakOut (out x1, x - 1, y, z);
            TakOut (out y1, y - 1, z, x);
            TakOut (out z1, z - 1, x, y);
            TakOut (out answer, x1, y1, z1);
        }

But the version using an out variable has an additional argument. I wanted to separate the cost of passing an additional argument from the total cost of using an out parameter, so I wrote another version of Tak that took an extra dummy argument:

        static int Tak1 (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1 (z, Tak1 (y, x - 1, y, z),
                       Tak1 (x, y - 1, z, x),
                       Tak1 (dummy, z - 1, x, y));
        }

These run pretty quick, so I put them in a loop that runs them ten times and measures the total time. I ran this loop a thousand times and recorded the results.

I didn't expect to find Tak1 to be substantially faster than Tak, so I did some investigating. Look at the graph for Tak:

You can clearly see a change in performance (there are two vertical segments). I had forgotten that my machine was in power save mode and was dynamically tuning the CPU clock speed. Somewhere in there the machine decided to hit the gas. I put the machine in performance mode to max out the clock and got a much more expected result.

In this graph, Tak is the normal Tak function. Tak1a and Tak1b are defined as follows:

        static int Tak1a (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1a (dummy, Tak1a (dummy, x - 1, y, z),
                            Tak1a (dummy, y - 1, z, x),
                            Tak1a (dummy, z - 1, x, y));
        }

        static int Tak1b (int dummy, int x, int y, int z)
        {
            return (!(y < x))
            ? z
            : Tak1b (z, Tak1b (y, x - 1, y, z),
                        Tak1b (x, y - 1, z, x),
                        Tak1b (dummy, z - 1, x, y));
        }

We can see from the graph that it takes an additional 8-10 ns to process the additional argument, depending on whether we change the value or keep it the same. I'll speculate that this is because the processor can notice the re-use of the argument and avoid a register rename or something along those lines.

So what about passing an out variable?

From this graph we can see that returning an answer in an out variable makes the code a factor of 2 slower. Using a ref variable is slightly worse, but that may be because we have to initialize the ref variable before we use it. In either case, that's quite a hit.

It isn't likely that returning an object from the heap will be fast, but we ought to actually mesaure it. This is easily done by simply declaring the return value to be of type object. The CLR will be forced to box the object on the heap.

Surprisingly, this isn't all that much slower than using an out or ref variable. What about returning a struct on the stack?

It seems that this is quicker than using an out variable.

If you look closely at these benchmarks you will see that the baseline performance of Tak varies a bit from run to run. This could be because of a number of reasons. It doesn't matter all that much because I was interested in the difference between the various mechanisms of returning values.

All the code above was run with these settings:

# LENOVO ThinkPad T61
# 2 processors
#     GenuineIntel
#     Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
#     x64 Family 6 Model 15 Stepping 11
#     2401 MHz
# Working set:  9969664
# Overflow checking is disabled.
# Compiled in RELEASE mode.
# Debugger is not attached.
# Stopwatch is high resolution.  14318180 ticks per second.
# 
# 2008-Aug-18 20:50:11
# Microsoft Windows NT 6.0.6001 Service Pack 1
# CLR 2.0.50727.3053
# 
# See http://eval.apply.googlepages.com/ for further information.
# 
# TakTest, Version=1.0.3152.23099, Culture=neutral, PublicKeyToken=null
# Debugging Flags: IgnoreSymbolStoreSequencePoints
# JIT Optimizer enabled
# JIT Tracking disabled

I've put the code online if you want to play with it.

Abstract Heresies

Monday, August 18, 2008

Differential Benchmarking

No comments: