Thursday, July 19, 2007

Reporting Performance

I've seen a lot of different ways of reporting performance benchmark results. The most popular way seems to be simply listing the elapsed running time. Sometimes there is a bar chart or graph. Every now and then the relative speed is reported against some baseline. There are some oddballs that report benchmark times in Hertz. All of these methods suffer from the same problem: the relative performance is not scale-free. Suppose you ran a benchmark on four different machines. Machine A takes 1.2 seconds, machine B takes 2.4 seconds, machine C takes 60.1 seconds, and machine D takes 80.7 seconds. Machine A is clearly winner coming in twice as fast as the next entry. But although machine C beats out machine D by a full 20 seconds, it isn't twice as fast as D. B would have to double to catch up with A, but D only needs to run 25% faster to catch up with C. If you plot these results on a graph or bar chart, you'd see that gap between D and C is much larger than the gap between B and A, but large gaps are to be expected when the time scale is larger. This problem is easy to fix. Simply take the logarithm of the run time. In the example above, the log times for A, B, C, and D are 0.18, 0.88, 4.10, and 4.39 respectively. Now A and B differ by .7 while C and D differ by .29 It is obvious that C is closer to D than B is to A. To give a real-life example, I grabbed the results of the fannkuch benchmark from the Computer Language Shootout. First, the timings as reported in the shootout:

C gcc                   5.82
D Digital Mars          6.57
Clean                   6.78
Lisp SBCL #2            7.20
Oberon-2 OO2C           7.39
Pascal Free Pascal #3   7.60
D Digital Mars #2       7.80
OCaml                   8.06
Eiffel SmartEiffel      8.22
Ada 95 GNAT             9.78
C++ g++                 9.95
Nice                   13.89
Java 6 -server         14.63
Scala #2               14.67
CAL                    16.93
BASIC FreeBASIC #2     17.15
SML MLton              17.93
Haskell GHC #2         18.32
C# Mono                18.85
Fortran G95            23.02
Forth bigForth         24.46
Haskell GHC            69.09
Smalltalk VisualWorks  84.80
Erlang HiPE            97.60
Erlang HiPE #2         98.30
Scheme MzScheme       139.75
Scala                 299.77
Haskell GHC #3        441.82
Lua #3                582.46
Pike                  598.58
Python                641.36
Mozart/Oz #2          739.06
Perl #2               906.29
PHP                  1165.02
Tcl #2               1456.69
Ruby                 1786.76


Now the timings in log scale:

C gcc                  1.76
D Digital Mars         1.88
Clean                  1.91
Lisp SBCL #2           1.97
Oberon-2 OO2C          2.00
Pascal Free Pascal #3  2.03
D Digital Mars #2      2.05
OCaml                  2.09
Eiffel SmartEiffel     2.11
Ada 95 GNAT            2.28
C++ g++                2.30
Nice                   2.63
Java 6 -server         2.68
Scala #2               2.69
CAL                    2.83
BASIC FreeBASIC #2     2.84
SML MLton              2.89
Haskell GHC #2         2.91
C# Mono                2.94
Fortran G95            3.14
Forth bigForth         3.20
Haskell GHC            4.22
Smalltalk VisualWorks  4.44
Erlang HiPE            4.58
Erlang HiPE #2         4.59
Scheme MzScheme        4.94
Scala                  5.70
Haskell GHC #3         6.09
Lua #3                 6.37
Pike                   6.39
Python                 6.46
Mozart/Oz #2           6.61
Perl #2                6.81
PHP                    7.06
Tcl #2                 7.28
Ruby                   7.49


There are a couple of features that weren't obvious in at first. There is a noticable gap between C++ g++ and Nice, a huge gap between Forth bigForth and Haskell GHC #2, and another between Scheme MzScheme and Scala. Lua #3 is pretty close to Pike, even though they differ by 16 seconds real-time, but Nice and g++ are further apart even though they differ by less than 4 seconds of real time. I have a further refinement in the next post.