Friday, January 7, 2011

Schema evolution finessed

Schema evolution is the bête noire of the object-oriented database world. It is probably the most labor intensive and error prone activity that an end-user is likely to attempt. The root problem is that the metadata about persistent objects is implicitly replicated in several ways.

The big advantage of object-oriented databases is that they unify the query language with the ‘business logic’ language. You access and navigate persistent objects in the exact same way you access and navigate the ordinary data structures in the language. The obvious benefit is that the mechanisms of persistence and querying are hidden behind the object abstraction. But this means that the definition of the data structures are performing double duty. Data structure definitions are used by the compiler to generate the appropriate primitive operations in the code, but they also are needed by the database component to generate the persistent representation of objects. If the data definition changes, objects that conform to the old definition will no longer work.

Programmers generally expect that a recompile is all that is needed to change how a program handles data structures. But if the data definition is used by the database for object storage and retrieval then changing the data definition may change the persistent representation and make it impossible to retrieve data. (Or worse, it could silently retrieve the data incorrectly.)

It turns out that if persistent objects are immutable then there is a way to finesse the issue of schema evolution.

In CLOS, objects are created and initialized when the programmer invokes make-instance. There is a well-defined series of operations that take place and the resulting object is returned to the caller. The important thing here is the intent of the programmer. The arguments given to make-instance contain the specific information that the programmer has determined is needed to correctly initialize the object. The returned instance will depend on the argument values and any default values that are supplied in the class definition. From an information theoretic view, the argument list to make-instance and the resulting instance ought to contain the same information. So rather than attempting to persist the resulting instance, we instead persist the argument list to make-instance.

The immediate objection to this is that there is no way to know what on earth make-instance constructed! It could be the case that make-instance decides to return NIL. There could be a very complex initialization protocol that defaults certain slot values and computes other slot values from the initargs. This doesn't matter. The programmer is working at the abstraction level where the entire process is kicked off by the invocation of make-instance, and thus his intent is simply to create an instance of an object that is parameterized by the given initargs.

When we retrieve an object from the database, we don't attempt to retrieve the representation of the constructed object, we instead retrieve the arguments that were handed to make-instance and simply invoke make-instance again to re-create the object.

Now suppose that we have a database where we are modeling automobiles. Let us assume that we have already stored thousands of automobile instances and that each instance contains, say, the number of doors and the color of the automobile. At some point we decide that we no longer care about the number of doors, but we do want to know the make and model of the automobile. We change the definition of the automobile class in the program. When we invoke make-instance, we now have a different set of initargs. However, the legacy objects in the database are reconstructed by calling make-instance with the old argument lists. This is easily dealt with. The older argument lists will have initargs describing the number of doors and the color of the car. We arrange for make-instance to ignore any initargs that are unused, and we require the programmer to supply ‘reasonable defaults’ for initargs that are not supplied. Thus when retrieving an older object, the count of the number of doors will be ignored, and a default value will be supplied for the make and model.

Note that we no longer require a schema to describe the layout of objects in the database.

There are drawbacks to this strategy. First, we are making the assumption that all the relevant information for object creation is supplied in the call to make-instance. A perverse programmer can easily bypass this assumption. However, it doesn't seem unreasonable to require that programmers either use the object creation protocol in the intended manner, or to specifically arrange for auxiliary helper objects to be created that hold any missing relevant information. This is meant as a tool for programmers, not as a fool-proof system. (Incidentally, the full-blown system I wrote allows the programmer to declare certain slots as transient. This allows a programmer to cache information needed for the runtime without forcing the information to be persisted.) Second, we give up the ability to side effect persistent objects. (As I have noted in previous posts, we can mimic side effects reasonably easy.)

These disadvantages are minor compared to the difficulties of schema evolution.

I'm happy to answer questions about this.

3 comments:

grant rettke said...

Has this ever been tried before?

Joe Marshall said...

I don't know if it has been tried before.

Faré said...

I used similar MOP-based tricks to have generic object printers, copiers, etc.

Problem is that indeed you need to do something about those slots without keyword initializers, slots that are computed from other slots (e.g. hash values), slots that are initialized from a counter (e.g. OID), slots that are actually plumbing from a larger structure (e.g. indexes back into other objects), etc.

The approach still works, but is actually very low-level, and calls for a higher-level interface of some sort, least your whole system keep a very low-level feel.