Monday, September 19, 2011

Thoughts about GC

Suppose I have a program that, when run, allocates a total of 10 GiB (10×230 bytes). If it happens that I have more than 10 GiB of heap, then I don't need to worry about running out. If my heap is smaller than 10 GiB, I'm going to have a problem.

It is often possible to re-use storage by overwriting the previously written information. If the result computed by a program (including the side effects as part of the ‘result’) is completely invariant with respect to the value of a location in memory, then that memory location could be re-used. Depending upon the program and the runtime system, re-use can be the responsibility of the programmer, or there can be an automatic means of storage management (e.g. a garbage collector). Regardless of how re-usable storage is identified and made available to the program, we can use the pigeonhole principle to determine the minimum number of times that we must re-use some part of memory. For example, if my program uses 10×230 and I have 10×230−1 bytes of heap, then at least one byte somewhere will have to be re-used.

So let us suppose that our program which allocates 10 GiB is to be run with a 2 GiB heap and our runtime system provides a garbage collector so we don't have to manage our storage manually. Suppose further that our garbage collector is not continuous, but has some sort of identifiable ‘cycle’ during which it locates re-usable memory. On each cycle the garbage collector can reclaim at most 2 GiB, so we know the runtime system must perform a minimum of at least five garbage collection cycles. It may perform many, many more than five, but it cannot perform fewer because it would have to recycle more than 100% of the heap at some point.

Another way to state this is that the number of garbage collections times the size of the heap must be equal to or greater than the total amount of storage allocated by a program. Here is a graph that shows the minimum number of collections needed for a program that allocates 50 bytes of storage.
The discrete jumps in the graph are because we aren't considering partial allocations (less than 1 byte). When working with realistic heap sizes, we can approximate this graph by dividing the total allocation by the heap size. The green line in this chart plots 50/x.

More to come...

No comments: